Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-disaster-recovery-mysql-python
Packt
25 Sep 2010
10 min read
Save for later

Disaster Recovery in MySQL for Python

Packt
25 Sep 2010
10 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Read more about this book (For more resources on Phython, see here.) The purpose of the archiving methods covered in this article is to allow you, as the developer, to back up databases that you use for your work without having to rely on the database administrator. As noted later in the article, there are more sophisticated methods for backups than we cover here, but they involve system-administrative tasks that are beyond the remit of any development post and are thus beyond the scope of this article. Every database needs a backup plan When archiving a database, one of the critical questions that must be answered is how to take a snapshot backup of the database without having users change the data in the process. If data changes in the midst of the backup, it results in an inconsistent backup and compromises the integrity of the archive. There are two strategic determinants for backing up a database system: Offline backups Live backups Which you use depends on the dynamics of the system in question and the import of the data being stored. In this article, we will look at each in turn and the way to implement them. Offline backups Offline backups are done by shutting down the server so the records can be archived without the fear of them being changed by the user. It also helps to ensure the server shut down gracefully and that errors were avoided. The problem with using this method on most production systems is that it necessitates a temporary loss of access to the service. For most service providers, such a consequence is anathema to the business model. The value of this method is that one can be certain that the database has not changed at all while the backup is run. Further, in many cases, the backup is performed faster because the processor is not simultaneously serving data. For this reason, offline backups are usually performed in controlled environments or in situations where disruption is not critical to the user. These include internal databases, where administrators can inform all users about the disruption ahead of time, and small business websites that do not receive a lot of traffic. Offline backups also have the benefit that the backup is usually held in a single file. This can then be used to copy a database across hosts with relative ease. Shutting down a server obviously requires system administrator-like authority. So creating an offline backup relies on the system administrator shutting down the server. If your responsibilities include database administration, you will also have sufficient permission to shut down the server. Live backups Live backups occur while the server continues to accept queries from users, while it's still online. It functions by locking down the tables so no new data may be written to them. Users usually do not lose access to the data and the integrity of the archive, for a particular point in time is assured. Live backups are used by large, data-intensive sites such as Nokia's Ovi services and Google's web services. However, because they do not always require administrator access of the server itself, these tend to suit the backup needs of a development project. Choosing a backup method After having determined whether a database can be stopped for the backup, a developer can choose from three methods of archiving: Copying the data files (including administrative files such as logs and tablespaces) Exporting delimited text files Backing up with command-line programs Which you choose depends on what permissions you have on the server and how you are accessing the data. MySQL also allows for two other forms of backup: using the binary log and by setting up replication (using the master and slave servers). To be sure, these are the best ways to back up a MySQL database. But, both of these are administrative tasks and require system-administrator authority; they are not typically available to a developer. However, you can read more about them in the MySQL documentation. Use of the binary log for incremental backups is documented at: http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html Setting up replication is further dealt with at: http://dev.mysql.com/doc/refman/5.5/en/replication-solutions-backups.html Copying the table files The most direct way to back up database files is to copy from where MySQL stores the database itself. This will naturally vary based on platform. If you are unsure about which directory holds the MySQL database files, you can query MySQL itself to check: mysql> SHOW VARIABLES LIKE 'datadir'; Alternatively, the following shell command sequence will give you the same information: $ mysqladmin variables | grep datadir| datadir | /var/lib/mysql/ | Note that the location of administrative files, such as binary logs and InnoDB tablespaces are customizable and may not be in the data directory. If you do not have direct access to the MySQL server, you can also write a simple Python program to get the information: #!/usr/bin/env pythonimport MySQLdbmydb = MySQLdb.connect('<hostname>', '<user>', '<password>')cursor = mydb.cursor()runit = cursor.execute("SHOW VARIABLES LIKE 'datadir'")results = cursor.fetchall()print "%s: %s" %(cursor.fetchone()) Slight alteration of this program will also allow you to query several servers automatically. Simply change the login details and adapt the output to clarify which data is associated with which results. Locking and flushing If you are backing up an offline MyISAM system, you can copy any of the files once the server has been stopped. Before backing up a live system, however, you must lock the tables and flush the log files in order to get a consistent backup at a specific point. These tasks are handled by the LOCK TABLES and FLUSH commands respectively. When you use MySQL and its ancillary programs (such as mysqldump) to perform a backup, these tasks are performed automatically. When copying files directly, you must ensure both are done. How you apply them depends on whether you are backing up an entire database or a single table. LOCK TABLES The LOCK TABLES command secures a specified table in a designated way. Tables can be referenced with aliases using AS and can be locked for reading or writing. For our purposes, we need only a read lock to create a backup. The syntax looks like this: LOCK TABLES <tablename> READ; This command requires two privileges: LOCK TABLES and SELECT. It must be noted that LOCK TABLES does not lock all tables in a database but only one. This is useful for performing smaller backups that will not interrupt services or put too severe a strain on the server. However, unless you automate the process, manually locking and unlocking tables as you back up data can be ridiculously inefficient. FLUSH The FLUSH command is used to reset MySQL's caches. By re-initiating the cache at the point of backup, we get a clear point of demarcation for the database backup both in the database itself and in the logs. The basic syntax is straightforward, as follows: FLUSH <the object to be reset>; Use of FLUSH presupposes the RELOAD privilege for all relevant databases. What we reload depends on the process we are performing. For the purpose of backing up, we will always be flushing tables: FLUSH TABLES; How we "flush" the tables will depend on whether we have already used the LOCK TABLES command to lock the table. If we have already locked a given table, we can call FLUSH for that specific table: FLUSH TABLES <tablename>; However, if we want to copy an entire database, we can bypass the LOCK TABLES command by incorporating the same call into FLUSH: FLUSH TABLES WITH READ LOCK; This use of FLUSH applies across the database, and all tables will be subject to the read lock. If the account accessing the database does not have sufficient privileges for all databases, an error will be thrown. Unlocking the tables Once you have copied the files for a backup, you need to remove the read lock you imposed earlier. This is done by releasing all locks for the current session: UNLOCK TABLES; Restoring the data Restoring copies of the actual storage files is as simple as copying them back into place. This is best done when MySQL has stopped, lest you risk corruption. Similarly, if you have a separate MySQL server and want to transfer a database, you simply need to copy the directory structure from the one server to another. On restarting, MySQL will see the new database and treat it as if it had been created natively. When restoring the original data files, it is critical to ensure the permissions on the files and directories are appropriate and match those of the other MySQL databases. Delimited backups within MySQL MySQL allows for exporting of data from the MySQL command line. To do so, we simply direct the output from a SELECT statement to an output file. Using SELECT INTO OUTFILE to export data Using sakila, we can save the data from film to a file called film.data as follows: SELECT * INTO OUTFILE 'film.data' FROM film; This results in the data being written in a tab-delimited format. The file will be written to the directory in which MySQL stores the sakila data. Therefore, the account under which the SELECT statement is executed must have the FILE privilege for writing the file as well as login access on the server to view it or retrieve it. The OUTFILE option on SELECT can be used to write to any place on the server that MySQL has write permission to use. One simply needs to prepend that directory location to the file name. For example, to write the same file to the /tmp directory on a Unix system, use: SELECT * INTO OUTFILE '/tmp/film.data' FROM film; Windows simply requires adjustment of the directory structure accordingly. Using LOAD DATA INFILE to import data If you have an output file or similar tab-delimited file and want to load it into MySQL, use the LOAD DATA INFILE command. The basic syntax is: LOAD DATA INFILE '<filename>' INTO TABLE <tablename>; For example, to import the film.data file from the /tmp directory into another table called film2, we would issue this command: LOAD DATA INFILE '/tmp/film.data' INTO TABLE film2; Note that LOAD DATA INFILE presupposes the creation of the table into which the data is being loaded. In the preceding example, if film2 had not been created, we would receive an error. If you are trying to mirror a table, remember to use the SHOW CREATE TABLE query to save yourself time in formulating the CREATE statement. This discussion only touches on how to use LOAD DATA INFILE for inputting data created with the OUTFILE option of SELECT. But, the command handles text files with just about any set of delimiters. To read more on how to use it for other file formats, see the MySQL documentation at: http://dev.mysql.com/doc/refman/5.5/en/load-data.html
Read more
  • 0
  • 0
  • 6243

article-image-understanding-model-based-clustering
Packt
14 Sep 2015
10 min read
Save for later

Understanding Model-based Clustering

Packt
14 Sep 2015
10 min read
 In this article by Ashish Gupta, author of the book, Rapid – Apache Mahout Clustering Designs, we will discuss a model-based clustering algorithm. Model-based clustering is used to overcome some of the deficiencies that can occur in K-means or Fuzzy K-means algorithms. We will discuss the following topics in this article: Learning model-based clustering Understanding Dirichlet clustering Understanding topic modeling (For more resources related to this topic, see here.) Learning model-based clustering In model-based clustering, we assume that data is generated by a model and try to get the model from the data. The right model will fit the data better than other models. In the K-means algorithm, we provide the initial set of cluster, and K-means provides us with the data points in the clusters. Think about a case where clusters are not distributed normally, then the improvement of a cluster will not be good using K-means. In this scenario, the model-based clustering algorithm will do the job. Another idea you can think of when dividing the clusters is—hierarchical clustering—and we need to find out the overlapping information. This situation will also be covered by model-based clustering algorithms. If all components are not well separated, a cluster can consist of multiple mixture components. In simple terms, in model-based clustering, data is a mixture of two or more components. Each component has an associated probability and is described by a density function. Model-based clustering can capture the hierarchy and the overlap of the clusters at the same time. Partitions are determined by an EM (expectation-maximization) algorithm for maximum likelihood. The generated models are compared by a Bayesian Information criterion (BIC). The model with the lowest BIC is preferred. In the equation BIC = -2 log(L) + mlog(n), L is the likelihood function and m is the number of free parameters to be estimated. n is the number of data points. Understanding Dirichlet clustering Dirichlet clustering is a model-based clustering method. This algorithm is used to understand the data and cluster the data. Dirichlet clustering is a process of nonparametric and Bayesian modeling. It is nonparametric because it can have infinite number of parameters. Dirichlet clustering is based on Dirichlet distribution. For this algorithm, we have a probabilistic mixture of a number of models that are used to explain data. Each data point will be coming from one of the available models. The models are taken from the sample of a prior distribution of models, and points are assigned to these models iteratively. In each iteration probability, a point generated by a particular model is calculated. After the points are assigned to a model, new parameters for each of the model are sampled. This sample is from the posterior distribution of the model parameters, and it considers all the observed data points assigned to the model. This sampling provides more information than normal clustering listed as follows: As we are assigning points to different models, we can find out how many models are supported by the data. The other information that we can get is how well the data is described by a model and how two points are explained by the same model. Topic modeling In machine learning, topic modeling is nothing but finding out a topic from the text document using a statistical model. A document on particular topics has some particular words. For example, if you are reading an article on sports, there are high chances that you will get words such as football, baseball, Formula One and Olympics. So a topic model actually uncovers the hidden sense of the article or a document. Topic models are nothing but the algorithms that can discover the main themes from a large set of unstructured document. It uncovers the semantic structure of the text. Topic modeling enables us to organize large scale electronic archives. Mahout has the implementation of one of the topic modeling algorithms—Latent Dirichlet Allocation (LDA). LDA is a statistical model of document collection that tries to capture the intuition of the documents. In normal clustering algorithms, if words having the same meaning don't occur together, then the algorithm will not associate them, but LDA can find out which two words are used in similar context, and LDA is better than other algorithms in finding out the association in this way. LDA is a generative, probabilistic model. It is generative because the model is tweaked to fit the data, and using the parameters of the model, we can generate the data on which it fits. It is probabilistic because each topic is modeled as an infinite mixture over an underlying set of topic probabilities. The topic probabilities provide an explicit representation of a document. Graphically, a LDA model can be represented as follows: The notation used in this image represents the following: M, N, and K represent the number of documents, the number of words in the document, and the number of topics in the document respectively. is the prior weight of the K topic in a document. is the prior weight of the w word in a topic. φ is the probability of a word occurring in a topic. Θ is the topic distribution. z is the identity of a topic of all the words in all the documents. w is the identity of all the words in all the documents. How LDA works in a map-reduce mode? So these are the steps that LDA follows in mapper and reducer steps: Mapper phase: The program starts with an empty topic model. All the documents are read by different mappers. The probabilities of each topic for each word in the document are calculated. Reducer Phase: The reducer receives the count of probabilities. These counts are summed and the model is normalized. This process is iterative, and in each iteration the sum of the probabilities is calculated and the process stops when it stops changing. A parameter set, which is similar to the convergence threshold in K-means, is set to check the changes. In the end, LDA estimates how well the model fits the data. In Mahout, the Collapsed Variation Bayes (CVB) algorithm is implemented for LDA. LDA uses a term frequency vector as an input and not tf-idf vectors. We need to take care of the two parameters while running the LDA algorithm—the number of topics and the number of words in the documents. A higher number of topics will provide very low level topics while a lower number will provide a generalized topic at high level, such as sports. In Mahout, mean field variational inference is used to estimate the model. It is similar to expectation-maximization of hierarchical Bayesian models. An expectation step reads each document and calculates the probability of each topic for each word in every document. The maximization step takes the counts and sums all the probabilities and normalizes them. Running LDA using Mahout To run LDA using Mahout, we will use the 20 Newsgroups dataset. We will convert the corpus to vectors, run LDA on these vectors, and get the resultant topics. Let's run this example to view how topic modeling works in Mahout. Dataset selection We will use the 20 Newsgroup dataset for this exercise. Download the 20news-bydate.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. Steps to execute CVB (LDA) Perform the following steps to execute the CVB algorithm: Create a 20newsdata directory and unzip the data here: mkdir /tmp/20newsdata cdtmp/20newsdatatar-xzvf /tmp/20news-bydate.tar.gz There are two folders under 20newsdata: 20news-bydate-test and 20news-bydate-train. Now, create another 20newsdataall directory and merge both the training and test data of the group. Now move to the home directory and execute the following command: mkdir /tmp/20newsdataall cp –R /20newsdata/*/* /tmp/20newsdataall Create a directory in Hadoop and save this data in HDFS: hadoopfs –mkdir /usr/hue/20newsdata hadoopfs –put /tmp/20newsdataall /usr/hue/20newsdata Mahout CVB will accept the data in the vector format. For this, first we will generate a sequence file from the directory as follows: bin/mahoutseqdirectory -i /user/hue/20newsdata/20newsdataall -o /user/hue/20newsdataseq-out Convert the sequence file to a sparse vector but, as discussed earlier, using the term frequency weight. bin/mahout seq2sparse -i /user/hue/20newsdataseq-out/part-m-00000 -o /user/hue/20newsdatavec -lnorm -nv -wtt Convert the sparse vector to the input form required by the CVB algorithm. bin/mahoutrowid -i /user/hue/20newsdatavec/tf-vectors –o /user/hue/20newsmatrix Convert the sparse vector to the input form required by CVB algorithm. bin/mahout cvb -i /user/hue/20newsmatrix/matrix –o /user/hue/ldaoutput–k 10 –x 20 –dict/user/hue/20newsdatavec/dictionary.file-0 –dt /user/hue/ldatopics –mt /user/hue/ldamodel The parameters used in the preceding command can be explained as follows:      -i: This is the input path of the document vector      -o: This is the output path of the topic term distribution      -k: This is the number of latent topics      -x: This is the maximum number of iterations      -dict: This is the term dictionary files      -dt: This is the output path of document—topic distribution      -mt: This is the model state path after each iteration The output of the preceding command can be seen as follows: Once the command finishes, you will get the information on the screen as follows: To view the output, run the following command : bin/mahout vectordump -i /user/hue/ldaoutput/ -d /user/hue/20newsdatavec/dictionary.file-0 -dtsequencefile -vs 10 -sort true -o /tmp/lda-output.txt The parameters used in the preceding command can be explained as follows:     -i: This is the input location of the CVB output     -d: This is the dictionary file location created during vector creation     -dt: This is the dictionary file type (sequence or text)     -vs: This is the vector size     -sort: This is the flag to put true or false     -o: This is the output location of local filesystem Now your output will be saved in the local filesystem. Open the file and you will see an output similar to the following: From the preceding screenshot you can see that after running the algorithm, you will get the term and probability of that. Summary In this article, we learned about model-based clustering, the Dirichlet process, and topic modeling. In model-based clustering, we tried to obtain the model from the data ,while the Dirichlet process is used to understand the data. Topic modeling helps us to identify the topics in an article or in a set of documents. We discussed how Mahout has implemented topic modeling using the latent Dirichlet process and how it is implemented in map reduce. We discussed how to use Mahout to find out the topic distribution on a set of documents. Resources for Article: Further resources on this subject: Learning Random Forest Using Mahout[article] Implementing the Naïve Bayes classifier in Mahout[article] Clustering [article]
Read more
  • 0
  • 0
  • 6232

article-image-regular-expressions-python-26-text-processing
Packt
13 Jan 2011
17 min read
Save for later

Regular Expressions in Python 2.6 Text Processing

Packt
13 Jan 2011
17 min read
Python 2.6 Text Processing: Beginners Guide Simple string matching Regular expressions are notoriously hard to read, especially if you're not familiar with the obscure syntax. For that reason, let's start simple and look at some easy regular expressions at the most basic level. Before we begin, remember that Python raw strings allow us to include backslashes without the need for additional escaping. Whenever you define regular expressions, you should do so using the raw string syntax. Time for action – testing an HTTP URL In this example, we'll check values as they're entered via the command line as a means to introduce the technology. We'll dive deeper into regular expressions as we move forward. We'll be scanning URLs to ensure our end users inputted valid data. Create a new file and name it number_regex.py. Enter the following code: import sys import re # Make sure we have a single URL argument. if len(sys.argv) != 2: print >>sys.stderr, "URL Required" sys.exit(-1) # Easier access. url = sys.argv[1] # Ensure we were passed a somewhat valid URL. # This is a superficial test. if re.match(r'^https?:/{2}w.+$', url): print "This looks valid" else: print "This looks invalid" Now, run the example script on the command line a few times, passing various different values to it on the command line. (text_processing)$ python url_regex.py http://www.jmcneil.net This looks valid (text_processing)$ python url_regex.py http://intranet This looks valid (text_processing)$ python url_regex.py http://www.packtpub.com This looks valid (text_processing)$ python url_regex.py https://store This looks valid (text_processing)$ python url_regex.py httpsstore This looks invalid (text_processing)$ python url_regex.py https:??store This looks invalid (text_processing)$ What just happened? We took a look at a very simple pattern and introduced you to the plumbing needed to perform a match test. Let's walk through this little example, skipping the boilerplate code. First of all, we imported the re module. The re module, as you probably inferred from the name, contains all of Python's regular expression support. Any time you need to work with regular expressions, you'll need to import the re module. Next, we read a URL from the command line and bind a temporary attribute, which makes for cleaner code. Directly below that, you should notice a line that reads re.match(r'^https?:/{2}w.+$', url). This line checks to determine whether the string referenced by the url attribute matches the ^https?:/{2}w.+$ pattern. If a match is found, we'll print a success message; otherwise, the end user would receive some negative feedback indicating that the input value is incorrect. This example leaves out a lot of details regarding HTTP URL formats. If you were performing validation on user input, one place to look would be http://formencode.org/. FormEncode is a HTML form-processing and data-validation framework written by Ian Bicking. Understanding the match function The most basic method of testing for a match is via the re.match function, as we did in the previous example. The match function takes a regular expression pattern and a string value. For example, consider the following snippet of code: Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'pattern', 'pattern') <_sre.SRE_Match object at 0x1004811d0> >>> Here, we simply passed a regular expression of "pattern" and a string literal of "pattern" to the re.match function. As they were identical, the result was a match. The returned Match object indicates the match was successful. The re.match function returns None otherwise. >>> re.match(r'pattern', 'failure') >>> Learning basic syntax A regular expression is generally a collection of literal string data and special metacharacters that represents a pattern of text. The simplest regular expression is just literal text that only matches itself. In addition to literal text, there are a series of special characters that can be used to convey additional meaning, such as repetition, sets, wildcards, and anchors. Generally, the punctuation characters field this responsibility. Detecting repetition When building up expressions, it's useful to be able to match certain repeating patterns without needing to duplicate values. It's also beneficial to perform conditional matches. This lets us check for content such as "match the letter a, followed by the number one at least three times, but no more than seven times." For example, the code below does just that: Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'^a1{3,7}$', 'a1111111') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^a1{3,7}$', '1111111') >>> If the repetition operator follows a valid regular expression enclosed in parenthesis, it will perform repetition on that entire expression. For example: >>> re.match(r'^(a1){3,7}$', 'a1a1a1') <_sre.SRE_Match object at 0x100493918> >>> re.match(r'^(a1){3,7}$', 'a11111') >>> The following table details all of the special characters that can be used for marking repeating values within a regular expression. Specifying character sets and classes In some circumstances, it's useful to collect groups of characters into a set such that any of the values in the set will trigger a match. It's also useful to match any character at all. The dot operator does just that. A character set is enclosed within standard square brackets. A set defines a series of alternating (or) entities that will match a given text value. If the first character within a set is a caret (^) then a negation is performed. All characters not defined by that set would then match. There are a couple of additional interesting set properties. For ranged values, it's possible to specify an entire selection using a hyphen. For example, '[0-6a-d]' would match all values between 0 and 6, and a and d. Special characters listed within brackets lose their special meaning. The exceptions to this rule are the hyphen and the closing bracket. If you need to include a closing bracket or a hyphen within a regular expression, you can either place them as the first elements in the set or escape them by preceding them with a backslash. As an example, consider the following snippet, which matches a string containing a hexadecimal number. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'^0x[a-f0-9]+$', '0xff') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^0x[a-f0-9]+$', '0x01') <_sre.SRE_Match object at 0x1004816b0> >>> re.match(r'^0x[a-f0-9]+$', '0xz') >>> In addition to the bracket notation, Python ships with some predefined classes. Generally, these are letter values prefixed with a backslash escape. When they appear within a set, the set includes all values for which they'll match. The d escape matches all digit values. It would have been possible to write the above example in a slightly more compact manner. >>> re.match(r'^0x[a-fd]+$', '0x33') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^0x[a-fd]+$', '0x3f') <_sre.SRE_Match object at 0x1004816b0> >>> The following table outlines all of the character sets and classes available: One thing that should become apparent is that lowercase classes are matches whereas their uppercase counterparts are the inverse. Applying anchors to restrict matches There are times where it's important that patterns match at a certain position within a string of text. Why is this important? Consider a simple number validation test. If a user enters a digit, but mistakenly includes a trailing letter, an expression checking for the existence of a digit alone will pass. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'd', '1f') <_sre.SRE_Match object at 0x1004811d0> >>> Well, that's unexpected. The regular expression engine sees the leading '1' and considers it a match. It disregards the rest of the string as we've not instructed it to do anything else with it. To fix the problem that we have just seen, we need to apply anchors. >>> re.match(r'^d$', '6') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^d$', '6f') >>> Now, attempting to sneak in a non-digit character results in no match. By preceding our expression with a caret (^) and terminating it with a dollar sign ($), we effectively said "between the start and the end of this string, there can only be one digit." Anchors, among various other metacharacters, are considered zero-width matches. Basically, this means that a match doesn't advance the regular expression engine within the test string. We're not limited to the either end of a string, either. Here's a collection of all of the available anchors provided by Python. Wrapping it up Now that we've covered the basics of regular expression syntax, let's double back and take a look at the expression we used in our first example. It might be a bit easier if we break it down a bit more with a diagram. Now that we've provided a bit of background, this pattern should make sense. We begin the regular expression with a caret, which matches the beginning of the string. The very next element is the literal http. As our caret matches the start of a string and must be immediately followed by http, this is equivalent to saying that our string must start with http. Next, we include a question mark after the s in https. The question mark states that the previous entity should be matched either zero, or one time. By default, the evaluation engine is looking character-by-character, so the previous entity in this case is simply "s." We do this so our test passes for both secure and non-secure addresses. As we advanced forward in our string, the next special term we run into is {2}, and it follows a simple forward slash. This says that the forward slash should appear exactly two times. Now, in the real world, it would probably make more sense to simply type the second slash. Using the repetition check like this not only requires more typing, but it also causes the regular expression engine to work harder. Immediately after the repetition match, we include a w. The w, if you'll remember from the previous tables, expands to [0-9a-zA-Z_], or any word character. This is to ensure that our URL doesn't begin with a special character. The dot character after the w matches anything, except a new line. Essentially, we're saying "match anything else, we don't so much care." The plus sign states that the preceding wild card should match at least once. Finally, we're anchoring the end of the string. However, in this example, this isn't really necessary. Have a go hero – tidying up our URL test There are a few intentional inconsistencies and problems with this regular expression as designed. To name a few: Properly formatted URLs should only contain a few special characters. Other values should be URL-encoded using percent escapes. This regular expression doesn't check for that. It's possible to include newline characters towards the end of the URL, which is clearly not supported by any browsers! The w followed by the. + implicitly set a minimum limit of two characters after the protocol specification. A single letter is perfectly valid. You guessed it. Using what we've covered thus far, it should be possible for you to backtrack and update our regular expression in order to fix these flaws. For more information on what characters are allowed, have a look at http://www.w3schools.com/tags/ref_urlencode.asp. Advanced pattern matching In addition to basic pattern matching, regular expressions let us handle some more advanced situations as well. It's possible to group characters for purposes of precedence and reference, perform conditional checks based on what exists later, or previously, in a string, and limit exactly how much of a match actually constitutes a match. Don't worry; we'll clarify that last phrase as we move on. Let's go! Grouping When crafting a regular expression string, there are generally two reasons you would wish to group expression components together: entity precedence or to enable access to matched parts later in your application. Time for action – regular expression grouping In this example, we'll return to our LogProcessing application. Here, we'll update our log split routines to divide lines up via a regular expression as opposed to simple string manipulation. In core.py, add an import re statement to the top of the file. This makes the regular expression engine available to us. Directly above the __init__ method definition for LogProcessor, add the following lines of code. These have been split to avoid wrapping. _re = re.compile( r'^([d.]+) (S+) (S+) [([w/:+ ]+)] "(.+?)" ' r'(?P<rcode>d{3}) (S+) "(S+)" "(.+)"') Now, we're going to replace the split method with one that takes advantage of the new regular expression: def split(self, line): """ Split a logfile. Uses a simple regular expression to parse out the Apache logfile entries. """ line = line.strip() match = re.match(self._re, line) if not match: raise ParsingError("Malformed line: " + line) return { 'size': 0 if match.group(6) == '-' else int(match.group(6)), 'status': match.group('rcode'), 'file_requested': match.group(5).split()[1] } Running the logscan application should now produce the same output as it did when we were using a more basic, split-based approach. (text_processing)$ cat example3.log | logscan -c logscan.cfg What just happened? First of all, we imported the re module so that we have access to Python's regular expression services. Next, at the LogProcessor class level, we defined a regular expression. Though, this time we did so via re.compile rather than a simple string. Regular expressions that are used more than a handful of times should be "prepared" by running them through re.compile first. This eases the load placed on the system by frequently used patterns. The re.compile function returns a SRE_Pattern object that can be passed in just about anywhere you can pass in a regular expression. We then replace our split method to take advantage of regular expressions. As you can see, we simply pass self._re in as opposed to a string-based regular expression. If we don't have a match, we raise a ParsingError, which bubbles up and generates an appropriate error message, much like we would see on an invalid split case. Now, the end of the split method probably looks somewhat peculiar to you. Here, we've referenced our matched values via group identification mechanisms rather than by their list index into the split results. Regular expression components surrounded by parenthesis create a group, which can be accessed via the group method on the Match object later down the road. It's also possible to access a previously matched group from within the same regular expression. Let's look at a somewhat smaller example. >>> match = re.match(r'(0x[0-9a-f]+) (?P<two>1)', '0xff 0xff') >>> match.group(1) '0xff' >>> match.group(2) '0xff' >>> match.group('two') '0xff' >>> match.group('failure') Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: no such group >>> Here, we surround two distinct regular expressions components with parenthesis, (0x[0-9a-f]+), and (?P&lttwo>1). The first regular expression matches a hexadecimal number. This becomes group ID 1. The second expression matches whatever was found by the first, via the use of the 1. The "backslash-one" syntax references the first match. So, this entire regular expression only matches when we repeat the same hexadecimal number twice, separated with a space. The ?P&lttwo> syntax is detailed below. As you can see, the match is referenced after-the-fact using the match.group method, which takes a numeric index as its argument. Using standard regular expressions, you'll need to refer to a matched group using its index number. However, if you'll look at the second group, we added a (?P&ltname>) construct. This is a Python extension that lets us refer to groupings by name, rather than by numeric group ID. The result is that we can reference groups of this type by name as opposed to simple numbers. Finally, if an invalid group ID is passed in, an IndexError exception is thrown. The following table outlines the characters used for building groups within a Python regular expression: Finally, it's worth pointing out that parenthesis can also be used to alter priority as well. For example, consider this code. >>> re.match(r'abc{2}', 'abcc') <_sre.SRE_Match object at 0x1004818b8> >>> re.match(r'a(bc){2}', 'abcc') >>> re.match(r'a(bc){2}', 'abcbc') <_sre.SRE_Match object at 0x1004937b0> >>> Whereas the first example matches c exactly two times, the second and third line require us to repeat bc twice. This changes the meaning of the regular expression from "repeat the previous character twice" to "repeat the previous match within parenthesis twice." The value within the group could have been its own complex regular expression, such as a([b-c]) {2}. Have a go hero – updating our stats processor to use named groups Spend a couple of minutes and update our statistics processor to use named groups rather than integer-based references. This makes it slightly easier to read the assignment code in the split method. You do not need to create names for all of the groups, simply the ones we're actually using will do. Using greedy versus non-greedy operators Regular expressions generally like to match as much text as possible before giving up or yielding to the next token in a pattern string. If that behavior is unexpected and not fully understood, it can be difficult to get your regular expression correct. Let's take a look at a small code sample to illustrate the point. Suppose that with your newfound knowledge of regular expressions, you decided to write a small script to remove the angled brackets surrounding HTML tags. You might be tempted to do it like this: >>> match = re.match(r'(?P<tag><.+>)', '<title>Web Page</title>') >>> match.group('tag') '<title>Web Page</title>' >>> The result is probably not what you expected. The reason we got this result was due to the fact that regular expressions are greedy by nature. That is, they'll attempt to match as much as possible. If you look closely, &lttitle> is a match for the supplied regular expression, as is the entire &lttitle&gtWeb Page</title> string. Both start with an angled-bracket, contain at least one character, and both end with an angled bracket. The fix is to insert the question mark character, or the non-greedy operator, directly after the repetition specification. So, the following code snippet fixes the problem. >>> match = re.match(r'(?P<tag><.+?>)', '<title>Web Page</title>') >>> match.group('tag') '<title>' >>> The question mark changes our meaning from "match as much as you possibly can" to "match only the minimum required to actually match."
Read more
  • 0
  • 0
  • 6212

article-image-make-things-pretty-ggplot2
Packt
24 Feb 2016
30 min read
Save for later

Make Things Pretty with ggplot2

Packt
24 Feb 2016
30 min read
 The objective of this article is to provide you with a general overview of the plotting environments in R and of the most efficient way of coding your graphs in it. We will go through the most important Integrated Development Environment (IDE) available for R as well as the most important packages available for plotting data; this will help you to get an overview of what is available in R and how those packages are compared with ggplot2. Finally, we will dig deeper into the grammar of graphics, which represents the basic concepts on which ggplot2 was designed. But first, let's make sure that you have a working version of R on your computer. (For more resources related to this topic, see here.) Getting ggplot2 up and running You can download the most up-to-date version of R from the R project website (http://www.r-project.org/). There, you will find a direct connection to the Comprehensive R Archive Network (CRAN), a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. In addition to access to the CRAN servers, on the website of the R project, you may also find information about R, a few technical manuals, the R journal, and details about the packages developed for R and stored in the CRAN repositories. At the time of writing, the current version of R is 3.1.2. If you have already installed R on your computer, you can check the actual version with the R.Version() code, or for a more concise result, you can use the R.version.string code that recalls only part of the output of the previous function. Packages in R In the next few pages of this article, we will quickly go through the most important visualization packages available in R, so in order to try the code, you will also need to have additional packages as well as ggplot2 up and running in your R installation. In the basic R installation, you will already have the graphics package available and loaded in the session; the lattice package is already available among the standard packages delivered with the basic installation, but it is not loaded by default. ggplot2, on the other hand, will need to be installed. You can install and load a package with the following code: > install.packages(“ggplot2”) > library(ggplot2) Keep in mind that every time R is started, you will need to load the package you need with the library(name_of_the_package) command to be able to use the functions contained in the package. In order to get a list of all the packages installed on your computer, you can use the call to the library() function without arguments. If, on the other hand, you would like to have a list of the packages currently loaded in the workspace, you can use the search() command. One more function that can turn out to be useful when managing your library of packages is .libPaths(), which provides you with the location of your R libraries. This function is very useful to trace back the package libraries you are currently using, if any, in addition to the standard library of packages, which on Windows is located by default in a path of the kind C:/Program Files/R/R-3.1.2/library. The following list is a short recap of the functions just discussed: .libPaths()   # get library location library()   # see all the packages installed search()   # see the packages currently loaded Integrated Development Environment (IDE) You will definitely be able to run the code and the examples explained in the article directly from the standard R Graphical User Interface (GUI), especially if you are frequently working with R in more complex projects or simply if you like to keep an eye on the different components of your code, such as scripts, plots, and help pages, you may well think about the possibility of using an IDE. The number of specific IDEs that get integrated with R is still limited, but some of them are quite efficient, well-designed and open source. RStudio RStudio (http://www.rstudio.com/) is a very nice and advanced programming environment developed specifically for R, and this would be my recommended choice of IDE as the R programming environment in most cases. It is available for all the major platforms (Windows, Linux, and Mac OS X), and it can be run on a local machine, such as your computer, or even over the Web, using RStudio Server. With RStudio Server, you can connect a browser-based interface (the RStudio IDE) to a version of R running on a remote Linux server. RStudio allows you to integrate several useful functionalities, in particular if you use R for a more complex project. The way the software interface is organized allows you to keep an eye on the different activities you very often deal with in R, such as working on different scripts, overviewing the installed packages, as well as having easy access to the help pages and the plots generated. This last feature is particularly interesting for ggplot2 since in RStudio, you will be able to easily access the history of the plots created instead of visualizing only the last created plot, as is the case in the default R GUI. One other very useful feature of RStudio is code completion. You can, in fact, start typing a comment, and upon pressing the Tab key, the interface will provide you with functions matching what you have written . This feature will turn out to be very useful in ggplot2, so you will not necessarily need to remember all the functions and you will also have guidance for the arguments of the functions as well. In Figure 1.1, you can see a screenshot from the current version of RStudio (v 0.98.1091): Figure 1.1: This is a screenshot of RStudio on Windows 8 The environment is composed of four different areas: Scripting area: In this area you can open, create, and write the scripts. Console area: This area is the actual R console in which the commands are executed. It is possible to type commands directly here in the console or write them in a script and then run them on the console (I would recommend the last option). Workspace/History area: In this area, you can find a practical summary of all the objects created in the workspace in which you are working and the history of the typed commands. Visualization area: Here, you can easily load packages, open R help files, and, even more importantly, visualize plots. The RStudio website provides a lot of material on how to use the program, such as manuals, tutorials, and videos, so if you are interested, refer to the website for more details. Eclipse and StatET Eclipse (http://www.eclipse.org/) is a very powerful IDE that was mainly developed in Java and initially intended for Java programming. Subsequently, several extension packages were also developed to optimize the programming environment for other programming languages, such as C++ and Python. Thanks to its original objective of being a tool for advanced programming, this IDE is particularly intended to deal with very complex programming projects, for instance, if you are working on a big project folder with many different scripts. In these circumstances, Eclipse could help you to keep your programming scripts in order and have easy access to them. One drawback of such a development environment is probably its big size (around 200 MB) and a slightly slow-starting environment. Eclipse does not support interaction with R natively, so in order to be able to write your code and execute it directly in the R console, you need to add StatET to your basic Eclipse installation. StatET (http://www.walware.de/goto/statet) is a plugin for the Eclipse IDE, and it offers a set of tools for R coding and package building. More detailed information on how to install Eclipse and StatET and how to configure the connections between R and Eclipse/StatET can be found on the websites of the related projects. Emacs and ESS Emacs (http://www.gnu.org/software/emacs/) is a customizable text editor and is very popular, particularly in the Linux environment. Although this text editor appears with a very simple GUI, it is an extremely powerful environment, particularly thanks to the numerous keyboard shortcuts that allow interaction with the environment in a very efficient manner after getting some practice. Also, if the user interface of a typical IDE, such as RStudio, is more sophisticated and advanced, Emacs may be useful if you need to work with R on systems with a poor graphical interface, such as servers and terminal windows. Like Eclipse, Emacs does not support interfacing with R by default, so you will need to install an add-on package on your Emacs that will enable such a connection, Emacs Speaks Statistics (ESS). ESS (http://ess.r-project.org/) is designed to support the editing of scripts and interacting with various statistical analysis programs including R. The objective of the ESS project is to provide efficient text editor support to statistical software, which in some cases comes with a more or less defined GUI, but for which the real power of the language is only accessible through the original scripting language. The plotting environments in R R provides a complete series of options to realize graphics, which makes it quite advanced with regard to data visualization. Along the next few sections of this article, we will go through the most important R packages for data visualization by quickly discussing some high-level differences and analogies. If you already have some experience with other R packages for data visualization, in particular graphics or lattice, the following sections will provide you with some references and examples of how the code used in such packages appears in comparison with that used in ggplot2. Moreover, you will also have an idea of the typical layout of the plots created with a certain package, so you will be able to identify the tool used to realize the plots you will come across. The core of graphics visualization in R is within the grDevices package, which provides the basic structure of data plotting, such as the colors and fonts used in the plots. Such a graphic engine was then used as the starting point in the development of more advanced and sophisticated packages for data visualization, the most commonly used being graphics and grid. The graphics package is often referred to as the base or traditional graphics environment since, historically, it was the first package for data visualization available in R, and it provides functions that allow the generation of complete plots. The grid package, on the other hand, provides an alternative set of graphics tools. This package does not directly provide functions that generate complete plots, so it is not frequently used directly to generate graphics, but it is used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built by implementing different visualization approaches—Trellis plots in the case of lattice and the grammar of graphics in the case of ggplot2. We will describe these principles in more detail in the coming sections. A diagram representing the connections between the tools just mentioned is shown in Figure 1.2. Just keep in mind that this is not a complete overview of the packages available but simply a small snapshot of the packages we will discuss. Many other packages are built on top of the tools just mentioned, but in the following sections, we will focus on the most relevant packages used in data visualization, namely graphics, lattice, and, of course, ggplot2. If you would like to get a more complete overview of the graphics tools available in R, you can have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. Figure 1.2: This is an overview of the most widely used R packages for graphics In order to see some examples of plots in graphics, lattice and ggplot2, we will go through a few examples of different plots over the following pages. The objective of providing these examples is not to do an exhaustive comparison of the three packages but simply to provide you with a simple comparison of how the different codes as well as the default plot layouts appear for these different plotting tools. For these examples, we will use the Orange dataset available in R; to load it in the workspace, simply write the following code: >data(Orange) This dataset contains records of the growth of orange trees. You can have a look at the data by recalling its first lines with the following code: >head(Orange) You will see that the dataset contains three columns. The first one, Tree, is an ID number indicating the tree on which the measurement was taken, while age and circumference refer to the age in days and the size of the tree in millimeters, respectively. If you want to have more information about this data, you can have a look at the help page of the dataset by typing the following code: ?Orange Here, you will find the reference of the data as well as a more detailed description of the variables included. Standard graphics and grid-based graphics The existence of these two different graphics environments brings these questions  to most users' minds—which package to use and under which circumstances? For simple and basic plots, where the data simply needs to be represented in a standard plot type (such as a scatter plot, histogram, or boxplot) without any additional manipulation, then all the plotting environments are fairly equivalent. In fact, it would probably be possible to produce the same type of plot with graphics as well as with lattice or ggplot2. Nevertheless, in general, the default graphic output of ggplot2 or lattice will be most likely superior compared to graphics since both these packages are designed considering the principles of human perception deeply and to make the evaluation of data contained in plots easier. When more complex data should be analyzed, then the grid-based packages, lattice and ggplot2, present a more sophisticated support in the analysis of multivariate data. On the other hand, these tools require greater effort to become proficient because of their flexibility and advanced functionalities. In both cases, lattice and ggplot2, the package provides a full set of tools for data visualization, so you will not need to use grid directly in most cases, but you will be able to do all your work directly with one of those packages. Graphics and standard plots The graphics package was originally developed based on the experience of the graphics environment in R. The approach implemented in this package is based on the principle of the pen-on-paper model, where the plot is drawn in the first function call and once content is added, it cannot be deleted or modified. In general, the functions available in this package can be divided into high-level and low-level functions. High-level functions are functions capable of drawing the actual plot, while low-level functions are functions used to add content to a graph that was already created with a high-level function. Let's assume that we would like to have a look at how age is related to the circumference of the trees in our dataset Orange; we could simply plot the data on a scatter plot using the high-level function plot() as shown in the following code: plot(age~circumference, data=Orange) This code creates the graph in Figure 1.3. As you would have noticed, we obtained the graph directly with a call to a function that contains the variables to plot in the form of y~x, and the dataset to locate them. As an alternative, instead of using a formula expression, you can use a direct reference to x and y, using code in the form of plot(x,y). In this case, you will have to use a direct reference to the data instead of using the data argument of the function. Type in the following code: plot(Orange$circumference, Orange$age) The preceding code results in the following output: Figure 1.3: Simple scatterplot of the dataset Orange using graphics For the time being, we are not interested in the plot's details, such as the title or the axis, but we will simply focus on how to add elements to the plot we just created. For instance, if we want to include a regression line as well as a smooth line to have an idea of the relation between the data, we should use a low-level function to add the just-created additional lines to the plot; this is done with the lines() function: plot(age~circumference, data=Orange)   ###Create basic plot abline(lm(Orange$age~Orange$circumference), col=”blue”) lines(loess.smooth(Orange$circumference,Orange$age), col=”red”) The graph generated as the output of this code is shown in Figure 1.4: Figure 1.4: This is a scatterplot of the Orange data with a regression line (in blue) and a smooth line (in red) realized with graphics As illustrated, with this package, we have built a graph by first calling one function, which draws the main plot frame, and then additional elements were included using other functions. With graphics, only additional elements can be included in the graph without changing the overall plot frame defined by the plot() function. This ability to add several graphical elements together to create a complex plot is one of the fundamental elements of R, and you will notice how all the different graphical packages rely on this principle. If you are interested in getting other code examples of plots in graphics, there is also some demo code available in R for this package, and it can be visualized with demo(graphics). In the coming sections, you will find a quick reference to how you can generate a similar plot using graphics and ggplot2. As will be described in more detail later on, in ggplot2, there are two main functions to realize plots, ggplot() and qplot(). The function qplot() is a wrapper function that is designed to easily create basic plots with ggplot2, and it has a similar code to the plot() function of graphics. Due to its simplicity, this function is the easiest way to start working with ggplot2, so we will use this function in the examples in the following sections. The code in these sections also uses our example dataset Orange; in this way, you can run the code directly on your console and see the resulting output. Scatterplot with individual data points To generate the plot generated using graphics, use the following code: plot(age~circumference, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplots with the line of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”l”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=”line”) The preceding code results in the following output: Scatterplots with the line and points of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”b”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Boxplot with individual observations To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) points(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=c(“boxplot”,”point”)) The preceding code results in the following output: Histogram of orange dataset To generate the plot using graphics, use the following code: hist(Orange$circumference) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To generate the plot using graphics, use the following code: hist(Orange$circumference) abline(v=median(Orange$circumference), col=”red”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: Lattice and the Trellis plots Along with with graphics, the base R installation also includes the lattice package. This package implements a family of techniques known as Trellis graphics, proposed by William Cleveland to visualize complex datasets with multiple variables. The objective of those design principles was to ensure the accurate and faithful communication of data information. These principles are embedded into the package and are already evident in the default plot design settings. One interesting feature of Trellis plots is the option of multipanel conditioning, which creates multiple plots by splitting the data on the basis of one variable. A similar option is also available in ggplot2, but in that case, it is called faceting. In lattice, we also have functions that are able to generate a plot with one single call, but once the plot is drawn, it is already final. Consequently, plot details as well as additional elements that need to be included in the graph, need to be specified already within the call to the main function. This is done by including all the specifications in the panel function argument. These specifications can be included directly in the main body of the function or specified in an independent function, which is then called; this last option usually generates more readable code, so this will be the approach used in the following examples. For instance, if we want to draw the same plot we just generated in the previous section with graphics, containing the age and circumference of trees and also the regression and smooth lines, we need to specify such elements within the function call. You may see an example of the code here; remember that lattice needs to be loaded in the workspace: require(lattice)              ##Load lattice if needed myPanel <- function(x,y){ panel.xyplot(x,y)            # Add the observations panel.lmline(x,y,col=”blue”)   # Add the regression panel.loess(x,y,col=”red”)      # Add the smooth line } xyplot(age~circumference, data=Orange, panel=myPanel) This code produces the plot in Figure 1.5: Figure 1.5: This is a scatter plot of the Orange data with the regression line (in blue) and the smooth line (in red) realized with lattice As you would have noticed, taking aside the code differences, the plot generated does not look very different from the one obtained with graphics. This is because we are not using any special visualization feature of lattice. As mentioned earlier, with this package, we have the option of multipanel conditioning, so let's take a look at this. Let's assume that we want to have the same plot but for the different trees in the dataset. Of course, in this case, you would not need the regression or the smooth line, since there will only be one tree in each plot window, but it could be nice to have the different observations connected. This is shown in the following code: myPanel <- function(x,y){ panel.xyplot(x,y, type=”b”) #the observations } xyplot(age~circumference | Tree, data=Orange, panel=myPanel) This code generates the graph shown in Figure 1.6: Figure 1.6: This is a scatterplot of the Orange data realized with lattice, with one subpanel representing the individual data of each tree. The number of trees in each panel is reported in the upper part of the plot area As illustrated, using the vertical bar |, we are able to obtain the plot conditional to the value of the variable Tree. In the upper part of the panels, you would notice the reference to the value of the conditional variable, which, in this case, is the column Tree. As mentioned before, ggplot2 offers this option too; we will see one example of that in the next section. In the next section, You would find a quick reference to how to convert a typical plot type from lattice to ggplot2. In this case, the examples are adapted to the typical plotting style of the lattice plots. Scatterplot with individual observations To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplot of orange dataset with faceting To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, facets=~Tree) The preceding code results in the following output: Faceting scatterplot with line and points To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, geom=c(“line”,”point”), facets=~Tree) The preceding code results in the following output: Scatterplots with grouping data To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange, groups=Tree, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange,color=Tree, geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To plot the graph using lattice, use the following code: bwplot(circumference~Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Histogram of orange dataset To plot the graph using lattice, use the following code: histogram(Orange$circumference, type = “count”) To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To plot the graph using lattice, use the following code: histogram(~circumference, data=Orange, type = “count”, panel=function(x,...){panel.histogram(x, ...);panel.abline(v=median(x), col=”red”)}) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: ggplot2 and the grammar of graphics The ggplot2 package was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As is the case with lattice, this package is also based on grid, providing a series of high-level functions that allow the creation of complete plots. The Grammar of Graphics by Leland Wilkinson. Briefly, The Grammar of Graphics assumes that a statistical graphic is a mapping of data to the aesthetic attributes and geometric objects used to represent data, such as points, lines, bars, and so on. Besides the aesthetic attributes, the plot can also contain statistical transformation or grouping of data. As in lattice, in ggplot2, we have the possibility of splitting data by a certain variable, obtaining a representation of each subset of data in an independent subplot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are the data and its mapping, aesthetics, geometric objects, statistical transformations, scales, coordinates, and faceting: The data that must be visualized is mapped to aesthetic attributes, which define how the data should be perceived Geometric objects describe what is actually displayed on the plot, such as lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw Statistical transformations are applied to the data to group them; examples of statistical transformations would be the smooth line or the regression lines of the previous examples or the binning of the histograms Scales represent the connection between the aesthetic spaces and the actual values that should be represented. Scales may also be used to draw legends Coordinates represent the coordinate system in which the data is drawn Faceting, which we have already mentioned, is the grouping of data in subsets defined by a value of one variable In ggplot2, there are two main high-level functions capable of directly creating a plot, qplot(), and ggplot(); qplot() stands for quick plot, and it is a simple function that serves a purpose similar to that served by the plot() function in graphics. The ggplot()function, on the other hand, is a much more advanced function that allows the user to have more control of the plot layout and details. In our journey into the world of ggplot2, we will see some examples of qplot(), in particular when we go through the different kinds of graphs, but we will dig a lot deeper into ggplot() since this last function is more suited to advanced examples. If you have a look at the different forums based on R programming, there is quite a bit of discussion as to which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plots, where only the data should be represented and only the minor modification of standard layouts are required, the qplot() function will do the job. On the other hand, if you need to apply particular transformations to the data or if you would just like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend that you focus on ggplot(). As you will see, the code between these functions is not completely different since they are both based on the same underlying philosophy, but the way in which the options are set is quite different, so if you want to adapt a plot from one function to the other, you will essentially need to rewrite your code. If you just want to focus on learning only one of them, I would definitely recommend that you learn ggplot(). In the following code, you will see an example of a plot realized with ggplot2, where you can identify some of the components of the grammar of graphics. The example is realized with the ggplot() function, which allows a more direct comparison with the grammar of graphics, but coming just after the following code, you could also find the corresponding qplot() code useful. Both codes generate the graph depicted in Figure 1.7: require(ggplot2)                             ## Load ggplot2 data(Orange)                                 ## Load the data   ggplot(data=Orange,                          ## Data used   aes(x=circumference,y=age, color=Tree))+   ## Aesthetic geom_point()+                                ## Geometry stat_smooth(method=”lm”,se=FALSE)            ## Statistics   ### Corresponding code with qplot() qplot(circumference,age,data=Orange,         ## Data used   color=Tree,                                ## Aesthetic mapping   geom=c(“point”,”smooth”),method=”lm”,se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body creates the connection between the data and the aesthetics we are interested to represent and how, on top of this, you add the components of the plot, as in this case, we added the geometry element of points and the statistical element of regression. You can also notice how the components that need to be added to the main function call are included using the + sign. One more thing worth mentioning at this point is that if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attribute, which, in this case is geom_point(). This is perfectly in line with the grammar of graphics since, as we have seen, the geometry represents the actual connection between the data and what is represented on the plot. This is the stage where we specify that the data should be represented as points; before that, nothing was specified about which plot we were interested in drawing. Figure 1.7: This is an example of plotting the Orange dataset with ggplot2 Summary To learn more about the similar technology, the following books/videos published by Packt Publishing (https://www.packtpub.com/) are recommended: ggplot2 Essentials (https://www.packtpub.com/big-data-and-business-intelligence/ggplot2-essentials) Video: Building Interactive Graphs with ggplot2 and Shiny (https://www.packtpub.com/big-data-and-business-intelligence/building-interactive-graphs-ggplot2-and-shiny-video) Resources for Article: Further resources on this subject: Refresher [article] Interactive Documents [article] Driving Visual Analyses Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 6191

article-image-protecting-your-bitcoins
Packt
29 Oct 2015
32 min read
Save for later

Protecting Your Bitcoins

Packt
29 Oct 2015
32 min read
In this article by Richard Caetano author of the book Learning Bitcoin, we will explore ways to safely hold your own bitcoin. We will cover the following topics: Storing your bitcoins Working with brainwallet Understanding deterministic wallets Storing Bitcoins in cold storage Good housekeeping with Bitcoin (For more resources related to this topic, see here.) Storing your bitcoins The banking system has a legacy of offering various financial services to its customers. They offer convenient ways to spend money, such as cheques and credit cards, but the storage of money is their base service. For many centuries, banks have been a safe place to keep money. Customers rely on the interest paid on their deposits, as well as on the government insurance against theft and insolvency. Savings accounts have helped make preserving the wealth easy, and accessible to a large population in the western world. Yet, some people still save a portion of their wealth as cash or precious metals, usually in a personal safe at home or in a safety deposit box. They may be those who have, over the years, experienced or witnessed the downsides of banking: government confiscation, out of control inflation, or runs on the bank. Furthermore, a large population of the world does not have access to the western banking system. For those who live in remote areas or for those without credit, opening a bank account is virtually impossible. They must handle their own money properly to prevent loss or theft. In some places of the world, there can be great risk involved. These groups of people, who have little or no access to banking, are called the "underbanked". For the underbanked population, Bitcoin offers immediate access to a global financial system. Anyone with access to the internet or who carries a mobile phone with the ability to send and receive SMS messages, can hold his or her own bitcoin and make global payments. They can essentially become their own bank. However, you must understand that Bitcoin is still in its infancy as a technology. Similar to the Internet of circa 1995, it has demonstrated enormous potential, yet lacks usability for a mainstream audience. As a parallel, e-mail in its early days was a challenge for most users to set up and use, yet today it's as simple as entering your e-mail address and password on your smartphone. Bitcoin has yet to develop through these stages. Yet, with some simple guidance, we can already start realizing its potential. Let's discuss some general guidelines for understanding how to become your own bank. Bitcoin savings In most normal cases, we only keep a small amount of cash in our hand wallets to protect ourselves from theft or accidental loss. Much of our money is kept in checking or savings accounts with easy access to pay our bills. Checking accounts are used to cover our rent, utility bills, and other payments, while our savings accounts hold money for longer-term goals, such as a down payment on buying a house. It's highly advisable to develop a similar system for managing your Bitcoin money. Both local and online wallets provide a convenient way to access your bitcoins for day-to-day transactions. Yet there is the unlikely risk that one could lose his or her Bitcoin wallet due to an accidental computer crash or faulty backup. With online wallets, we run the risk of the website or the company becoming insolvent, or falling victim to cybercrime. By developing a reliable system, we can adopt our own personal 'Bitcoin Savings' account to hold our funds for long-term storage. Usually, these savings are kept offline to protect them from any kind of computer hacking. With protected access to our offline storage, we can periodically transfer money to and from our savings. Thus, we can arrange our Bitcoin funds much as we manage our money with our hand wallets and checking/savings accounts. Paper wallets As explained, a private key is a large random number that acts as the key to spend your bitcoins. A cryptographic algorithm is used to generate a private key and, from it, a public address. We can share the public address to receive bitcoins, and, with the private key, spend the funds sent to the address. Generally, we rely on our Bitcoin wallet software to handle the creation and management of our private keys and public addresses. As these keys are stored on our computers and networks, they are vulnerable to hacking, hardware failures, and accidental loss. Private keys and public addresses are, in fact, just strings of letters and numbers. This format makes it easy to move the keys offline for physical storage. Keys printed on paper are called "paper wallet" and are highly portable and convenient to store in a physical safe or a bank safety deposit box. With the private key generated and stored offline, we can safely send bitcoin to its public address. A paper wallet must include at least one private key and its computed public address. Additionally, the paper wallet can include a QR code to make it convenient to retrieve the key and address. Figure 3.1 is an example of a paper wallet generated by Coinbase: Figure 3.1 - Paper wallet generated from Coinbase The paper wallet includes both the public address (labeled Public key) and the private key, both with QR codes to easily transfer them back to your online wallet. Also included on the paper wallet is a place for notes. This type of wallet is easy to print for safe storage. It is recommended that copies are stored securely in multiple locations in case the paper is destroyed. As the private key is shown in plain text, anyone who has access to this wallet has access to the funds. Do not store your paper wallet on your computer. Loss of the paper wallet due to hardware failure, hacking, spyware, or accidental loss can result in complete loss of your bitcoin. Make sure you have multiple copies of your wallet printed and securely stored before transferring your money. One time use paper wallets Transactions from bitcoin addresses must include the full amount. When sending a partial amount to a recipient, the remaining balance must be sent to a change address. Paper wallet that includes only one private key are considered to be "one time use" paper wallet. While you can always send multiple transfers of bitcoin to the wallet, it is highly recommended that you spend the coins only once. Therefore, you shouldn't move a large number of bitcoins to the wallet expecting to spend a partial amount. With this in mind, when using one-time use paper wallet, it's recommended that you only save a usable amount to each wallet. This amount could be a block of coins that you'd like to fully redeem to your online wallet. Creating a paper wallet To create a paper wallet in Coinbase, simply log in with your username and password. Click on the Tools link on the left-hand side menu. Next, click on the Paper Wallets link from the above menu. Coinbase will prompt you to Generate a paper wallet and Import a paper wallet. Follow the links to generate a paper wallet. You can expect to see the paper wallet rendered, as shown in the following figure 3.2: Figure 3.2 - Creating a paper wallet with Coinbase Coinbase generates your paper wallet completely from your browser, without sending the private key back to its server. This is important to protect your private key from exposure to the network. You are generating the only copy of your private key. Make sure that you print and securely store multiple copies of your paper wallet before transferring any money to it. Loss of your wallet and private key will result in the loss of your bitcoin. By clicking the Regenerate button, you can generate multiple paper wallets and store various amounts of bitcoin on each wallet. Each wallet is easily redeemable in full at Coinbase or with other bitcoin wallet services. Verifying your wallet's balance After generating and printing multiple copies of your paper wallet, you're ready to transfer your funds. Coinbase will prompt you with an easy option to transfer the funds from your Coinbase wallet to your paper wallet: Figure 3.3 - Transferring funds to your paper wallet Figure 3.3 shows Coinbase's prompt to transfer your funds. It provides options to enter your amount in BTC or USD. Simply specify your amount and click Send. Note that Coinbase only keeps a copy of your public address. You can continue to send additional amounts to your paper wallet using the same public address. For your first time working with paper wallets, it's advisable that you only send small amounts of bitcoin, to learn and experiment with the process. Once you feel comfortable with creating and redeeming paper wallets, you can feel secure with transferring larger amounts. To verify that the funds have been moved to your paper wallet, we can use a blockchain explorer to verify that the funds have been confirmed by the network. Blockchain explorers make all the transaction data from the Bitcoin network available for public review. We'll use a service called Blockchain.info to verify our paper wallet. Simply open www.blockchain.info in your browser and enter the public key from your paper wallet in the search box. If found, Blockchain.info will display a list of the transaction activities on that address: Figure 3.4 - Blockchain.info showing transaction activity Shown in figure 3.4 is the transaction activity for the address starting with 16p9Lt. You can quickly see the total bitcoin received and the current balance. Under the Transactions section, you can find the details of the transactions recorded by the network. Also listed are the public addresses that were combined by the wallet software, as well as the change address used to complete the transfer. Note that at least six confirmations are required before the transaction is considered confirmed. Importing versus sweeping When importing your private key, the wallet software will simply add the key to its list of private keys. Your bitcoin wallet will manage your list of private keys. When sending money, it will combine the balances from multiple addresses to make the transfer. Any remaining amount will be sent back to the change address. The wallet software will automatically manage your change addresses. Some Bitcoin wallets offer the ability to sweep yourc private key. This involves a second step. After importing your private key, the wallet software will make a transaction to move the full balance of your funds to a new address. This process will empty your paper wallet completely. The step to transfer the funds may require additional time to allow the network to confirm your transaction. This process could take up to one hour. In addition to the confirmation time, a small miner's fee may be applied. This fee could be in the amount of 0.0001BTC. If you are certain that you are the only one with access to the private key, it is safe to use the import feature. However, if you believe someone else may have access to the private key, sweeping is highly recommended. Listed in the following table are some common bitcoin wallets which support importing a private key: Bitcoin Wallet Comments Sweeping Coinbase https://www.coinbase.com/ This provides direct integration between your online wallet and your paper wallet. No Electrum https://electrum.org This provides the ability to import and see your private key for easy access to your wallet's funds. Yes Armory https://bitcoinarmory.com/ This provides the ability to import your private key or "sweep" the entire balance. Yes Multibit https://multibit.org/ This directly imports your private key. It may use a built-in address generator for change addresses. No Table 1 - Wallets that support importing private keys Importing your Paper wallet To import your wallet, simply log into your Coinbase account. Click on Tools from the left-hand side menu, followed by Paper Wallet from the top menu. Then, click on the Import a paper wallet button. You will be prompted to enter the private key of your paper wallet, as show in figure 3.5: Figure 3.5 - Coinbase importing from a paper wallet Simply enter the private key from your paper wallet. Coinbase will validate the key and ask you to confirm your import. If accepted, Coinbase will import your key and sweep your balance. The full amount will be transferred to your bitcoin wallet and become available after six confirmations. Paper wallet guidelines Paper wallets display your public and private keys in plain text. Make sure that you keep these documents secure. While you can send funds to your wallet multiple times, it is highly recommended that you spend your balance only once and in full. Before sending large amounts of bitcoin to a paper wallet, make sure you are able to test your ability to generate and import the paper wallet with small amounts of bitcoin. When you're comfortable with the process, you can rely on them for larger amounts. As paper is easily destroyed or ruined, make sure that you keep multiple copies of your paper wallet in different locations. Make sure the location is secure from unwanted access. Be careful with online wallet generators. A malicious site operator can obtain the private key from your web browser. Only use trusted paper wallet generators. You can test the online paper wallet generator by opening the page in your browser while online, and then disconnecting your computer from the network. You should be able to generate your paper wallet when completely disconnected from the network, ensuring that your private keys are never sent back to the network. Coinbase is an exception in the fact that it only sends the public address back to the server for reference. This public address is saved to make it easy to transfer funds to your paper wallet. The private key is never saved by Coinbase when generating a paper wallet. Paper wallet services In addition to the services mentioned, there are other services that make paper wallets easy to generate and print. Listed next in Table 2 are just a few: Service Notes BitAddress bitaddress.org This offers the ability to generate single wallets, bulk wallets, brainwallets, and more. Bitcoin Paper Wallet bitcoinpaperwallet.com This offers nice, stylish design, and easy-to-use features. Users can purchase holographic stickers securing the paper wallets. Wallet Generator walletgenerator.net This offers printable paper wallets that fold nicely to conceal the private keys.  Table 2 - Services for generating paper wallets and brainwallets Brainwallets Storing our private keys offline by using a paper wallet is one way we can protect our coins from attacks on the network. Yet, having a physical copy of our keys is similar to holding a gold bar: it's still vulnerable to theft if the attacker can physically obtain the wallet. One way to protect bitcoins from online or offline theft is to have the codes recallable by memory. As holding long random private keys in memory is quite difficult, even for the best of minds, we'll have to use another method to generate our private keys. Creating a brainwallet Brainwallet is a way to create one or more private keys from a long phrase of random words. From the phrase, called a passphrase, we're able to generate a private key, along with its public addresses, to store bitcoin. We can create any passphrase we'd like. The longer the phrase and the more random the characters, the more secure it will be. Brainwallet phrases should contain at least 12 words. It is very important that the phrase should never come from anything published, such as a book or a song. Hackers actively search for possible brainwallets by performing brute force attacks on commonly-published phrases. Here is an example of a brainwallet passphrase: gently continue prepare history bowl shy dog accident forgive strain dirt consume Note that the phrase is composed of 12 seemingly random words. One could use an easy-to-remember sentence rather than 12 words. Regardless of whether you record your passphrase on paper or memorize it, the idea is to use a passphrase that's easy to recall and type, yet difficult to crack. Don't let this happen to you: "Just lost 4 BTC out of a hacked brain wallet. The pass phrase was a line from an obscure poem in Afrikaans. Somebody out there has a really comprehensive dictionary attack program running." Reddit Thread (http://redd.it/1ptuf3) Unfortunately, this user lost their bitcoin because they chose a published line from a poem. Make sure that you choose a passphrase that is composed of multiple components of non-published text. Sadly, although warned, some users may resort to simple phrases that are easy to crack. Simple passwords such as 123456, password1, and iloveyou are still commonly used with e-mails, and login accounts are routinely cracked. Do not use simple passwords for your brainwallet passphrase. Make sure that you use at least 12 words with additional characters and numbers. Using the preceding paraphrase, we can generate our private key and public address using the many tools available online. We'll use an online service called BitAddress to generate the actual brainwallet from the passphrase. Simply open www.bitaddress.org in your browser. At first, BitAddress will ask you to move your mouse cursor around to collect enough random points to generate a seed for generating random numbers. This process could take a minute or two. Once opened, select the option Brain Wallet from the top menu. In the form presented, enter the passphrase and then enter it again to confirm. Click on View to see your private key and public address. For the example shown in figure 3.6, we'll use the preceding passphrase example: Figure 3.6 - BitAddress's brainwallet feature From the page, you can easily copy and paste the public address and use it for receiving Bitcoin. Later, when you're ready to spend the coins, enter the same exact passphrase to generate the same private key and public address. Referring to our Coinbase example from earlier in the article, we can then import the private key into our wallet. Increasing brainwallet Security As an early attempt to give people a way to "memorize" their Bitcoin wallet, brainwallets have become a target for hackers. Some users have chosen phrases or sentences from common books as their brainwallet. Unfortunately, the hackers who had access to large amounts of computing power were able to search for these phrases and were able to crack some brainwallets. To improve the security of brainwallets, other methods have been developed which make brainwallets more secure. One service, called brainwallet.io, executes a time-intensive cryptographic function over the brainwallet phrase to create a seed that is very difficult to crack. It's important to know that the phase phrases used with BitAddress are not compatible with brainwallet.io. To use brainwallet.io to generate a more secure brainwallet, open http://brainwallet.io: Figure 3.7 - brainwallet.io, a more secure brainwallet generator Brainwallet.io needs a sufficient amount of entropy to generate a private key which is difficult to reproduce. Entropy, in computer science, can describe data in terms of its predictability. When data has high entropy, it could mean that it's difficult to reproduce from known sources. When generating private keys, it's very important to use data that has high entropy. For generating brainwallet keys, we need data with high entropy, yet it should be easy for us to duplicate. To meet this requirement, brainwallet.io accepts your random passphrase, or can generate one from a list of random words. Additionally, it can use data from a file of your choice. Either way, the more entropy given, the stronger your passphrase will be. If you specify a passphrase, choosing at least 12 words is recommended. Next, brainwallet.io prompts you for salt, available in several forms: login info, personal info, or generic. Salts are used to add additional entropy to the generation of your private key. Their purpose is to prevent standard dictionary attacks against your passphrase. While using brainwallet.io, this information is never sent to the server. When ready, click the generate button, and the page will begin computing a scrypt function over your passphrase. Scrypt is a cryptographic function that requires computing time to execute. Due to the time required for each pass, it makes brute force attacks very difficult. brainwallet.io makes many thousands of passes to ensure that a strong seed is generated for the private key. Once finished, your new private key and public address, along with their QR codes, will be displayed for easy printing. As an alternative, WarpWallet is also available at https://keybase.io/warp. WarpWallet also computes a private key based on many thousands of scrypt passes over a passphrase and salt combination. Remember that brainwallet.io passphrases are not compatible with WarpWallet passphrases. Deterministic wallets We have introduced brainwallets that yield one private key and public address. They are designed for one time use and are practical for holding a fixed amount of bitcoin for a period of time. Yet, if we're making lots of transactions, it would be convenient to have the ability to generate unlimited public addresses so that we can use them to receive bitcoin from different transactions or to generate change addresses. A Type 1 Deterministic Wallet is a simple wallet schema based on a passphrase with an index appended. By incrementing the index, an unlimited number of addresses can be created. Each new address is indexed so that its private key can be quickly retrieved. Creating a deterministic wallet To create a deterministic wallet, simply choose a strong passphrase, as previously described, and then append a number to represent an individual private key and public address. It's practical to do this with a spreadsheet so that you can keep a list of public addresses on file. Then, when you want to spend the bitcoin, you simply regenerate the private key using the index. Let's walk through an example. First, we choose the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become" Then, we append an index, sequential number, to the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become0" "dress retreat save scratch decide simple army piece scent ocean hand become1" "dress retreat save scratch decide simple army piece scent ocean hand become2" "dress retreat save scratch decide simple army piece scent ocean hand become3" "dress retreat save scratch decide simple army piece scent ocean hand become4" Then, we take each passphrase, with the corresponding index, and run it through brainwallet.io, or any other brainwallet service, to generate the public address. Using a table or a spreadsheet, we can pre-generate a list of public addresses to receive bitcoin. Additionally, we can add a balance column to help track our money: Index Public Address Balance 0 1Bc2WZ2tiodYwYZCXRRrvzivKmrGKg2Ub9 0.00 1 1PXRtWnNYTXKQqgcxPDpXEvEDpkPKvKB82 0.00 2 1KdRGNADn7ipGdKb8VNcsk4exrHZZ7FuF2 0.00 3 1DNfd491t3ABLzFkYNRv8BWh8suJC9k6n2 0.00 4 17pZHju3KL4vVd2KRDDcoRdCs2RjyahXwt 0.00  Table 3 - Using a spreadsheet to track deterministic wallet addresses Spending from a deterministic wallet When we have money available in our wallet to spend, we can simply regenerate the private key for the index matching the public address. For example, let's say we have received 2BTC on the address starting with 1KdRGN in the preceding table. Since we know it belongs to index #2, we can reopen the brainwallet from the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become2" Using brainwallet.io as our brainwallet service, we quickly regenerate the original private key and public address: Figure 3.8 - Private key re-generated from a deterministic wallet Finally, we import the private key into our Bitcoin wallet, as described earlier in the article. If we don't want to keep the change in our online wallet, we can simply send the change back to the next available public address in our deterministic wallet. Pre-generating public addresses with deterministic wallets can be useful in many situations. Perhaps you want to do business with a partner and want to receive 12 payments over the course of one year. You can simply regenerate the 12 addresses and keep track of each payment using a spreadsheet. Another example could apply to an e-commerce site. If you'd like to receive payment for the goods or services being sold, you can pre-generate a long list of addresses. Only storing the public addresses on your website protects you from malicious attack on your web server. While Type 1 deterministic wallets are very useful, we'll introduce a more advanced version called the Type 2 Hierarchical Deterministic Wallet next. Type 2 Hierarchical Deterministic wallets Type 2 Hierarchical Deterministic (HD) wallets function similarly to Type 1 deterministic wallets, as they are able to generate an unlimited amount of private keys from a single passphrase, but they offer more advanced features. HD wallets are used by desktop, mobile, and hardware wallets as a way of securing an unlimited number of keys by a single passphrase. HD wallets are secured by a root seed. The root seed, generated from entropy, can be a number up to 64 bytes long. To make the root seed easier to save and recover, a phrase consisting of a list of mnemonic code words is rendered. The following is an example of a root seed: 01bd4085622ab35e0cd934adbdcce6ca To render the mnemonic code words, the root seed number plus its checksum is combined and then divided into groups of 11 bits. Each group of bits represents an index between 0 and 2047. The index is then mapped to a list of 2,048 words. For each group of bits, one word is listed, as shown in the following example, which generates the following phrase: essence forehead possess embarrass giggle spirit further understand fade appreciate angel suffocate BIP-0039 details the specifications for creating mnemonic code words to generate a deterministic key, and is available at https://en.bitcoin.it/wiki/BIP_0039. In the HD wallet, the root seed is used to generate a master private key and a master chain code. The master private key is used to generate a master public key, as with normal Bitcoin private keys and public keys. These keys are then used to generate additional children keys in a tree-like structure. Figure 3.9 illustrates the process of creating the master keys and chain code from a root seed: Figure 3.9 - Generating an HD Wallet's root seed, code words, and master keys Using a child key derivation function, children keys can be generated from the master or parent keys. An index is then combined with the keys and the chain code to generate and organize parent/child relationships. From each parent, two billion children keys can be created, and from each child's private key, the public key and public address can be created. In addition to generating a private key and a public address, each child can be used as a parent to generate its own list of child keys. This allows the organization of the derived keys in a tree-like structure. Hierarchically, an unlimited amount of keys can be created in this way. Figure 3.10 - The relationship between master seed, parent/child chains, and public addresses HD wallets are very practical as thousands of keys and public addresses can be managed by one seed. The entire tree of keys can be backed up and restored simply by the passphrase. HD wallets can be organized and shared in various useful ways. For example, in a company or organization, a parent key and chain code could be issued to generate a list of keys for each department. Each department would then have the ability to render its own set of private/public keys. Alternatively, a public parent key can be given to generate child public keys, but not the private keys. This can be useful in the example of an audit. The organization may want the auditor to perform a balance sheet on a set of public keys, but without access to the private keys for spending. Another use case for generating public keys from a parent public key is for e-commerce. As an example mentioned previously, you may have a website and would like to generate an unlimited amount of public addresses. By generating a public parent key for the website, the shopping card can create new public addresses in real time. HD wallets are very useful for Bitcoin wallet applications. Next, we'll look at a software package called Electrum for setting up an HD wallet to protect your bitcoins. Installing a HD wallet HD wallets are very convenient and practical. To show how we can manage an unlimited number of addresses by a single passphrase, we'll install an HD wallet software package called Electrum. Electrum is an easy-to-use desktop wallet that runs on Windows, OS/X, and Linux. It implements a secure HD wallet that is protected by a 12-word passphrase. It is able to synchronize with the blockchain, using servers that index all the Bitcoin transactions, to provide quick updates to your balances. Electrum has some nice features to help protect your bitcoins. It supports multi-signature transactions, that is transactions that require more than one key to spend coins. Multi-signature transactions are useful when you want to share the responsibility of a Bitcoin address between two or more parties, or to add an extra layer of protection to your Bitcoins. Additionally, Electrum has the ability to create a watching-only version of your wallet. This allows you to give access to your public keys to another party without releasing the private keys. This can be very useful for auditing or accounting purposes. To install Electrum, simply open the url https://electrum.org/#download and follow the instructions for your operating system. On first installation, Electrum will create for you a new wallet identified by a passphrase. Make sure that you protect this passphrase offline! Figure 3.11 - Recording the passphrase from an Electrum wallet Electrum will proceed by asking you to re-enter the passphrase to confirm you have it recorded. Finally, it will ask you for a password. This password is used to encrypt your wallet's seed and any private keys imported into your wallet on-disk. You will need this password any time you send bitcoins from your account. Bitcoins in cold storage If you are responsible for a large amount of bitcoin which can be exposed to online hacking or hardware failure, it is important to minimize your risk. A common schema for minimizing the risk is to split your online wallet between Hot wallet and Cold Storage. A hot wallet refers to your online wallet used for everyday deposits and withdrawals. Based on your customers' needs, you can store the minimum needed to cover the daily business. For example, Coinbase claims to hold approximately five percent of the total bitcoins on deposit in their hot wallet. The remaining amount is stored in cold storage. Cold storage is an offline wallet for bitcoin. Addresses are generated, typically from a deterministic wallet, with their passphrase and private keys stored offline. Periodically, depending on their day-to-day needs, bitcoins are transferred to and from the cold storage. Additionally, bitcoins may be moved to Deep cold storage. These bitcoins are generally more difficult to retrieve. While cold storage transfer may easily be done to cover the needs of the hot wallet, a deep cold storage schema may involve physically accessing the passphrase / private keys from a safe, a safety deposit box, or a bank vault. The reasoning is to slow down the access as much as possible. Cold storage with Electrum We can use Electrum to create a hot wallet and a cold storage wallet. To exemplify, let's imagine a business owner who wants to accept bitcoin from his PC cash register. For security reasons, he may want to allow access to the generation of new addresses to receive Bitcoin, but not access to spending them. Spending bitcoins from this wallet will be secured by a protected computer. To start, create a normal Electrum wallet on the protected computer. Secure the passphrase and assign a strong password to the wallet. Then, from the menu, select Wallet | Master Public Keys. The key will be displayed as shown in figure 3.12. Copy this number and save it to a USB key. Figure 3.12 - Your Electrum wallet's public master key Your master public key can be used to generate new public keys, but without access to the private keys. As mentioned in the previous examples, this has many practical uses, as in our example with the cash register. Next, from your cash register, install Electrum. On setup, or from File | New/Restore, choose Restore a wallet or import keys and the Standard wallet type: Figure 3.13 - Setting up a cash register wallet with Electrum On the next screen, Electrum will prompt you to enter your public master key. Once accepted, Electrum will generate your wallet from the public master key. When ready, your new wallet will be ready to accept bitcoin without access to the private keys. WARNING: If you import private keys into your Electrum wallet, they cannot be restored from your passphrase or public master key. They have not been generated by the root seed and exist independently in the wallet. If you import private keys, make sure to back up the wallet file after every import. Verifying access to a private key When working with public addresses, it may be important to prove that you have access to a private key. By using Bitcoin's cryptographic ability to sign a message, you can verify that you have access to the key without revealing it. This can be offered as proof from a trustee that they control the keys. Using Electrum's built-in message signing feature, we can use the private key in our wallet to sign a message. The message, combined with the digital signature and public address, can later be used to verify that it was signed with the original private key. To begin, choose an address from your wallet. In Electrum, your addresses can be found under the Addresses tab. Next, right click on an address and choose Sign/verify Message. A dialog box allowing you to sign a message will appear: Figure 3.14 - Electrum's Sign/Verify Message features As shown in figure 3.14, you can enter any message you like and sign it with the private key of the address shown. This process will produce a digital signature that can be shared with others to prove that you have access to the private key. To verify the signature on another computer, simply open Electrum and choose Tools | Sign | Verify Message from the menu. You will be prompted with the same dialog as shown in figure 3.14. Copy and paste the message, the address, and the digital signature, and click Verify. The results will be displayed. By requesting a signed message from someone, you can verify that they do, in fact, have control of the private key. This is useful for making sure that the trustee of a cold storage wallet has access to the private keys without releasing or sharing them. Another good  use of message signing is to prove that someone has control of some quantity of bitcoin. By signing a message that includes the public address with funds, one can see that the party is the owner of the funds. Finally, signing and verifying a message can be useful for testing your backups. You can test that your private key and public address completely offline without actually sending bitcoin to the address. Good housekeeping with Bitcoin To ensure the safe-keeping of your bitcoin, it's important to protect your private keys by following a short list of best practices: Never store your private keys unencrypted on your hard drive or in the cloud: Unencrypted wallets can easily be stolen by hackers, viruses, or malware. Make sure your keys are always encrypted before being saved to disk. Never send money to a Bitcoin address without a backup of the private keys: It's really important that you have a backup of your private key before sending money its public address. There are stories of early adopters who have lost significant amounts of bitcoin because of hardware failures or inadvertent mistakes. Always test your backup process by repeating the recovery steps: When setting up a backup plan, make sure to test your plan by backing up your keys, sending a small amount to the address, and recovering the amount from the backup. Message signing and verification is also a useful way to test your private key backups offline. Ensure that you have a secure location for your paper wallets: Unauthorized access to your paper wallets can result in the loss of your bitcoin. Make sure that you keep your wallets in a secure safe, in a bank safety deposit box, or in a vault. It's advisable to keep copies of your wallets in multiple locations. Keep multiple copies of your paper wallets: Paper can easily be damaged by water or direct sunlight. Make sure that you keep multiple copies of your paper wallets in plastic bags, protected from direct light with a cover. Consider writing a testament or will for your Bitcoin wallets: The testament should name who has access to the bitcoin and how they will be distributed. Make sure that you include instructions on how to recover the coins. Never forget your wallet's password or passphrase: This sounds obvious, but it must be emphasized. There is no way to recover a lost password or passphrase. Always use a strong passphrase: A strong passphrase should meet the following requirements: It should be long and difficult to guess It should not be from a famous publication: literature, holy books, and so on It should not contain personal information It should be easy to remember and type accurately It should not be reused between sites and applications Summary So far, we've covered the basics of how to get started with Bitcoin. We've provided a tutorial for setting up an online wallet and for how to buy Bitcoin in 15 minutes. We've covered online exchanges and marketplaces, and how to safely store and protect your bitcoin. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Going Viral [article] Introduction to the Nmap Scripting Engine [article]
Read more
  • 0
  • 0
  • 6188

article-image-linking-section-access-multiple-dimensions
Packt
25 Jun 2013
3 min read
Save for later

Linking Section Access to multiple dimensions

Packt
25 Jun 2013
3 min read
(For more resources related to this topic, see here.) Getting ready Load the following script: Product:LOAD * INLINE [ ProductID, ProductGroup, ProductName 1, GroupA, Great As 2, GroupC, Super Cs 3, GroupC, Mega Cs 4, GroupB, Good Bs 5, GroupB, Busy Bs];Customer:LOAD * INLINE [ CustomerID, CustomerName, Country 1, Gatsby Gang, USA 2, Charly Choc, USA 3, Donnie Drake, USA 4, London Lamps, UK 5, Shylock Homes, UK];Sales:LOAD * INLINE [ CustomerID, ProductID, Sales 1, 2, 3536 1, 3, 4333 1, 5, 2123 2, 2, 45562, 4, 1223 2, 5, 6789 3, 2, 1323 3, 3, 3245 3, 4, 6789 4, 2, 2311 4, 3, 1333 5, 1, 7654 5, 2, 3455 5, 3, 6547 5, 4, 2854 5, 5, 9877];CountryLink:Load Distinct Country, Upper(Country) As COUNTRY_LINKResident Customer;Load Distinct Country, 'ALL' As COUNTRY_LINKResident Customer;ProductLink:Load Distinct ProductGroup, Upper(ProductGroup) As PRODUCT_LINKResident Product;Load Distinct ProductGroup, 'ALL' As PRODUCT_LINKResident Product;//Section Access;Access:LOAD * INLINE [ ACCESS, USERID, PRODUCT_LINK, COUNTRY_LINKADMIN, ADMIN, *, * USER, GM, ALL, ALL USER, CM1, ALL, USA USER, CM2, ALL, UK USER, PM1, GROUPA, ALL USER, PM2, GROUPB, ALL USER, PM3, GROUPC, ALL USER, SM1, GROUPB, UK USER, SM2, GROUPA, USA];Section Application; Note that there is a loop error generated on reload because there is a loop in the data structure. How to do it… Follow these steps to link Section Access to multiple dimensions: Add list boxes to the layout for ProductGroup and Country. Add a statistics box for Sales. Remove // to uncomment the Section Access statement. From the Settings menu, open Document Properties and select the Opening tab. Turn on the Initial Data Reduction Based on Section Access option. Reload and save the document. Close QlikView. Re-open QlikView and open the document. Log in as the Country Manager, CM1, user. Note that USA is the only country. Also, the product group, GroupA, is missing—there are no sales of this product group in USA. Close QlikView and then re-open again. This time, log in as the Sales Manager, SM2. You will not be allowed access to the document. Log into the document as the ADMIN user. Edit the script. Add a second entry for the SM2 user in the Access table as follows: USER, SM2, GROUPA, USA USER, SM2, GROUPB, UK Reload, save, and close the document and QlikView. Re-open and log in as SM2. Note the selections. How it works… Section Access is really quite simple. The user is connected to the data and the data is reduced accordingly. QlikView allows Section Access tables to be connected to multiple dimensions in the main data structure without causing issues with loops. Each associated field acts in the same way as a selection in the layout. The initial setting for the SM2 user contained values that were mutually exclusive. Because of the default Strict Exclusion setting, the SM2 user cannot log in. We changed the script and included multiple rows for the SM2 user. Intuitively, we might expect that, as the first row did not connect to the data, only the second row would connect to the data. However, each field value is treated as an individual selection and all of the values are included. There's more… If we wanted to include solely the composite association of Country and ProductGroup, we would need to derive a composite key in the data set and connect the user to that. In this example, we used the USERID field to test using QlikView logins. However, we would normally use NTNAME to link the user to either a Windows login or a custom login. Resources for Article : Further resources on this subject: Pentaho Reporting: Building Interactive Reports in Swing [Article] Visual ETL Development With IBM DataStage [Article] A Python Multimedia Application: Thumbnail Maker [Article]
Read more
  • 0
  • 0
  • 6166
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-segmenting-images-opencv
Packt
21 Jun 2011
7 min read
Save for later

Segmenting images in OpenCV

Packt
21 Jun 2011
7 min read
  OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision         Read more about this book       OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In the previous article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we took a look at image processing using morphological filters. In this article we will see how to segment images using watersheds and GrabCut algorithm. (For more resources related to this subject, see here.) Segmenting images using watersheds The watershed transformation is a popular image processing algorithm that is used to quickly segment an image into homogenous regions. It relies on the idea that when the image is seen as a topological relief, homogeneous regions correspond to relatively flat basins delimitated by steep edges. As a result of its simplicity, the original version of this algorithm tends to over-segment the image which produces multiple small regions. This is why OpenCV proposes a variant of this algorithm that uses a set of predefined markers which guide the definition of the image segments. How to do it... The watershed segmentation is obtained through the use of the cv::watershed function. The input to this function is a 32-bit signed integer marker image in which each non-zero pixel represents a label. The idea is to mark some pixels of the image that are known to certainly belong to a given region. From this initial labeling, the watershed algorithm will determine the regions to which the other pixels belong. In this recipe, we will first create the marker image as a gray-level image, and then convert it into an image of integers. We conveniently encapsulated this step into a WatershedSegmenter class: class WatershedSegmenter { private: cv::Mat markers;public: void setMarkers(const cv::Mat& markerImage) { // Convert to image of ints markerImage.convertTo(markers,CV_32S); } cv::Mat process(const cv::Mat &image) { // Apply watershed cv::watershed(image,markers); return markers; } The way these markers are obtained depends on the application. For example, some preprocessing steps might have resulted in the identification of some pixels belonging to an object of interest. The watershed would then be used to delimitate the complete object from that initial detection. In this recipe, we will simply use the binary image used in the previous article (OpenCV: Image Processing using Morphological Filters) in order to identify the animals of the corresponding original image. Therefore, from our binary image, we need to identify pixels that certainly belong to the foreground (the animals) and pixels that certainly belong to the background (mainly the grass). Here, we will mark foreground pixels with label 255 and background pixels with label 128 (this choice is totally arbitrary, any label number other than 255 would work). The other pixels, that is the ones for which the labeling is unknown, are assigned value 0. As it is now, the binary image includes too many white pixels belonging to various parts of the image. We will then severely erode this image in order to retain only pixels belonging to the important objects: // Eliminate noise and smaller objectscv::Mat fg;cv::erode(binary,fg,cv::Mat(),cv::Point(-1,-1),6); The result is the following image: Note that a few pixels belonging to the background forest are still present. Let's simply keep them. Therefore, they will be considered to correspond to an object of interest. Similarly, we also select a few pixels of the background by a large dilation of the original binary image: // Identify image pixels without objectscv::Mat bg;cv::dilate(binary,bg,cv::Mat(),cv::Point(-1,-1),6);cv::threshold(bg,bg,1,128,cv::THRESH_BINARY_INV); The resulting black pixels correspond to background pixels. This is why the thresholding operation immediately after the dilation assigns to these pixels the value 128. The following image is then obtained: These images are combined to form the marker image: // Create markers imagecv::Mat markers(binary.size(),CV_8U,cv::Scalar(0));markers= fg+bg; Note how we used the overloaded operator+ here in order to combine the images. This is the image that will be used as input to the watershed algorithm: The segmentation is then obtained as follows: // Create watershed segmentation objectWatershedSegmenter segmenter;// Set markers and processsegmenter.setMarkers(markers);segmenter.process(image); The marker image is then updated such that each zero pixel is assigned one of the input labels, while the pixels belonging to the found boundaries have value -1. The resulting image of labels is then: The boundary image is: How it works... As we did in the preceding recipe, we will use the topological map analogy in the description of the watershed algorithm. In order to create a watershed segmentation, the idea is to progressively flood the image starting at level 0. As the level of "water" progressively increases (to levels 1, 2, 3, and so on), catchment basins are formed. The size of these basins also gradually increase and, consequently, the water of two different basins will eventually merge. When this happens, a watershed is created in order to keep the two basins separated. Once the level of water has reached its maximal level, the sets of these created basins and watersheds form the watershed segmentation. As one can expect, the flooding process initially creates many small individual basins. When all of these are merged, many watershed lines are created which results in an over-segmented image. To overcome this problem, a modification to this algorithm has been proposed in which the flooding process starts from a predefined set of marked pixels. The basins created from these markers are labeled in accordance with the values assigned to the initial marks. When two basins having the same label merge, no watersheds are created, thus preventing the oversegmentation. This is what happens when the cv::watershed function is called. The input marker image is updated to produce the final watershed segmentation. Users can input a marker image with any number of labels with pixels of unknown labeling left to value 0. The marker image has been chosen to be an image of a 32-bit signed integer in order to be able to define more than 255 labels. It also allows the special value -1, to be assigned to pixels associated with a watershed. This is what is returned by the cv::watershed function. To facilitate the displaying of the result, we have introduced two special methods. The first one returns an image of the labels (with watersheds at value 0). This is easily done through thresholding: // Return result in the form of an imagecv::Mat getSegmentation() { cv::Mat tmp; // all segment with label higher than 255 // will be assigned value 255 markers.convertTo(tmp,CV_8U); return tmp;} Similarly, the second method returns an image in which the watershed lines are assigned value 0, and the rest of the image is at 255. This time, the cv::convertTo method is used to achieve this result: // Return watershed in the form of an imagecv::Mat getWatersheds() { cv::Mat tmp; // Each pixel p is transformed into // 255p+255 before conversion markers.convertTo(tmp,CV_8U,255,255); return tmp;} The linear transformation that is applied before the conversion allows -1 pixels to be converted into 0 (since -1*255+255=0). Pixels with a value greater than 255 are assigned the value 255. This is due to the saturation operation that is applied when signed integers are converted into unsigned chars. See also The article The viscous watershed transform by C. Vachier, F. Meyer, Journal of Mathematical Imaging and Vision, volume 22, issue 2-3, May 2005, for more information on the watershed transform. The next recipe which presents another image segmentation algorithm that can also segment an image into background and foreground objects.
Read more
  • 0
  • 0
  • 6135

article-image-oracle-goldengate-11g-performance-tuning
Packt
01 Mar 2011
12 min read
Save for later

Oracle GoldenGate 11g: Performance Tuning

Packt
01 Mar 2011
12 min read
  Oracle GoldenGate 11g Implementer's guide Design, install, and configure high-performance data replication solutions using Oracle GoldenGate The very first book on GoldenGate, focused on design and performance tuning in enterprise-wide environments Exhaustive coverage and analysis of all aspects of the GoldenGate software implementation, including design, installation, and advanced configuration Migrate your data replication solution from Oracle Streams to GoldenGate Design a GoldenGate solution that meets all the functional and non-functional requirements of your system Written in a simple illustrative manner, providing step-by-step guidance with discussion points Goes way beyond the manual, appealing to Solution Architects, System Administrators and Database Administrators        Oracle states that GoldenGate can achieve near real-time data replication. However, out of the box, GoldenGate may not meet your performance requirements. Here we focus on the main areas that lend themselves to tuning, especially parallel processing and load balancing, enabling high data throughput and very low latency. Let's start by taking a look at some of the considerations before we start tuning Oracle GoldenGate. Before tuning GoldenGate There are a number of considerations we need to be aware of before we start the tuning process. For one, we must consider the underlying system and its ability to perform. Let's start by looking at the source of data that GoldenGate needs for replication to work the online redo logs. Online redo Before we start tuning GoldenGate, we must look at both the source and target databases and their ability to read/write data. Data replication is I/O intensive, so fast disks are important, particularly for the online redo logs. Redo logs play an important role in GoldenGate: they are constantly being written to by the database and concurrently being read by the Extract process. Furthermore, adding supplemental logging to a database can increase their size by a factor of 4! Firstly, ensure that only the necessary amount of supplemental logging is enabled on the database. In the case of GoldenGate, the logging of the Primary Key is all that is required. Next, take a look at the database wait events, in particular the ones that relate to redo. For example, if you are seeing "Log File Sync" waits, this is an indicator that either your disk writes are too slow or your application is committing too frequently, or a combination of both. RAID5 is another common problem for redo log writes. Ideally, these files should be placed on their own mirrored storage such as RAID1+0 (mirrored striped sets) or Flash disks. Many argue this to be a misconception with modern high speed disk arrays, but some production systems are still known to be suffering from redo I/O contention on RAID5. An adequate number (and size) of redo groups must be configured to prevent "checkpoint not complete" or "cannot allocate new log" warnings appearing in the database instance alert log. This occurs when Oracle attempts to reuse a log file but the checkpoint that would flush the blocks in the DB buffer cache to disk are still required for crash recovery. The database must wait until that checkpoint completes before the online redolog file can be reused, effectively stalling the database and any redo generation. Large objects (LOBs) Know your data. LOBs can be a problem in data replication by virtue of their size and the ability to extract, transmit, and deliver the data from source to target. Tables containing LOB datatypes should be isolated from regular data to use a dedicated Extract, Data Pump, and Replicat process group to enhance throughput. Also ensure that the target table has a primary key to avoid Full Table Scans (FTS), an Oracle GoldenGate best practice. LOB INSERT operations can insert an empty (null) LOB into a row before updating it with the data. This is because a LOB (depending on its size) can spread its data across multiple Logical Change Records, resulting in multiple DML operations required at the target database. Base lining Before we can start tuning, we must record our baseline. This will provide a reference point to tune from. We can later look back at our baseline and calculate the percentage improvement made from deploying new configurations. An ideal baseline is to find the "breaking point" of your application requirements. For example, the following questions must be answered: What is the maximum acceptable end to end latency? What are the maximum application transactions per second we must accommodate? To answer these questions we must start with a single threaded data replication configuration having just one Extract, one Data Pump, and one Replicat process. This will provide us with a worst case scenario in which to build improvements on. Ideally, our data source should be the application itself, inserting, deleting, and updating "real data" in the source database. However, simulated data with the ability to provide throughput profiles will allow us to gauge performance accurately Application vendors can normally provide SQL injector utilities that simulate the user activity on the system. Balancing the load across parallel process groups The GoldenGate documentation states "The most basic thing you can do to improve GoldenGate's performance is to divide a large number of tables among parallel processes and trails. For example, you can divide the load by schema".This statement is true as the bottleneck is largely due to the serial nature of the Replicat process, having to "replay" transactions in commit order. Although this can be a constraining factor due to transaction dependency, increasing the number of Replicat processes increases performance significantly. However, it is highly recommended to group tables with referential constraints together per Replicat. The number of parallel processes is typically greater on the target system compared to the source. The number and ratio of processes will vary across applications and environments. Each configuration should be thoroughly tested to determine the optimal balance, but be careful not to over allocate, as each parallel process will consume up to 55MB. Increasing the number of processes to an arbitrary value will not necessarily improve performance, in fact it may be worse and you will waste CPU and memory resources. The following data flow diagram shows a load balancing configuration including two Extract processes, three Data Pump, and five Replicats: Considerations for using parallel process groups To maintain data integrity, ensure to include tables with referential constraints between one another in the same parallel process group. It's also worth considering disabling referential constraints on the target database schema to allow child records to be populated before their parents, thus increasing throughput. GoldenGate will always commit transactions in the same order as the source, so data integrity is maintained. Oracle best practice states no more than 3 Replicat processes should read the same remote trail file. To avoid contention on Trail files, pair each Replicat with its own Trail files and Extract process. Also, remember that it is easier to tune an Extract process than a Replicat process, so concentrate on your source before moving your focus to the target. Splitting large tables into row ranges across process groups What if you have some large tables with a high data change rate within a source schema and you cannot logically separate them from the remaining tables due to referential constraints? GoldenGate provides a solution to this problem by "splitting" the data within the same schema via the @RANGE function. The @RANGE function can be used in the Data Pump and Replicat configuration to "split" the transaction data across a number of parallel processes. The Replicat process is typically the source of performance bottlenecks because, in its normal mode of operation, it is a single-threaded process that applies operations one at a time by using regular DML. Therefore, to leverage parallel operation and enhance throughput, the more Replicats the better (dependant on the number of CPUs and memory available on the target system). The RANGE function The way the @RANGE function works is it computes a hash value of the columns specified in the input. If no columns are specified, it uses the table's primary key. GoldenGate adjusts the total number of ranges to optimize the even distribution across the number of ranges specified. This concept can be compared to Hash Partitioning in Oracle tables as a means of dividing data. With any division of data during replication, the integrity is paramount and will have an effect on performance. Therefore, tables having a relationship with other tables in the source schema must be included in the configuration. If all your source schema tables are related, you must include all the tables! Adding Replicats with @RANGE function The @RANGE function accepts two numeric arguments, separated by a comma: Range: The number assigned to a process group, where the first is 1 and the second 2 and so on, up to the total number of ranges. Total number of ranges: The total number of process groups you wish to divide using the @RANGE function. The following example includes three related tables in the source schema and walks through the complete configuration from start to finish. For this example, we have an existing Replicat process on the target machine (dbserver2) named ROLAP01 that includes the following three tables: ORDERS ORDER_ITEMS PRODUCTS We are going to divide the rows of the tables across two Replicat groups. The source database schema name is SRC and target schema TGT. The following steps add a new Replicat named ROLAP02 with the relevant configuration and adjusts Replicat ROLAP01 parameters to suit. Note that before conducting any changes stop the existing Replicat processes and determine their Relative Byte Address (RBA) and Trail file log sequence number. This is important information that we will use to tell the new Replicat process from which point to start. First check if the existing Replicat process is running: GGSCI (dbserver2) 1> info all Program Status Group Lag Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING ROLAP01 00:00:00 00:00:02 Stop the existing Replicat process: GGSCI (dbserver2) 2> stop REPLICAT ROLAP01 Sending STOP request to REPLICAT ROLAP01... Request processed. Add the new Replicat process, using the existing trail file. GGSCI (dbserver2) 3> add REPLICAT ROLAP02, exttrail ./dirdat/tb REPLICAT added. Now add the configuration by creating a new parameter file for ROLAP02. GGSCI (dbserver2) 4> edit params ROLAP02 -- -- Example Replicator parameter file to apply changes -- to target tables -- REPLICAT ROLAP02 SOURCEDEFS ./dirdef/mydefs.def SETENV (ORACLE_SID= OLAP) USERID ggs_admin, PASSWORD ggs_admin DISCARDFILE ./dirrpt/rolap02.dsc, PURGE ALLOWDUPTARGETMAP CHECKPOINTSECS 30 GROUPTRANSOPS 2000 MAP SRC.ORDERS, TARGET TGT.ORDERS, FILTER (@RANGE (1,2)); MAP SRC.ORDER_ITEMS, TARGET TGT.ORDER_ITEMS, FILTER (@RANGE (1,2)); MAP SRC.PRODUCTS, TARGET TGT.PRODUCTS, FILTER (@RANGE (1,2)); Now edit the configuration of the existing Replicat process, and add the @RANGE function to the FILTER clause of the MAP statement. Note the inclusion of the GROUPTRANSOPS parameter to enhance performance by increasing the number of operations allowed in a Replicat transaction. GGSCI (dbserver2) 5> edit params ROLAP01 -- -- Example Replicator parameter file to apply changes -- to target tables -- REPLICAT ROLAP01 SOURCEDEFS ./dirdef/mydefs.def SETENV (ORACLE_SID=OLAP) USERID ggs_admin, PASSWORD ggs_admin DISCARDFILE ./dirrpt/rolap01.dsc, PURGE ALLOWDUPTARGETMAP CHECKPOINTSECS 30 GROUPTRANSOPS 2000 MAP SRC.ORDERS, TARGET TGT.ORDERS, FILTER (@RANGE (2,2)); MAP SRC.ORDER_ITEMS, TARGET TGT.ORDER_ITEMS, FILTER (@RANGE (2,2)); MAP SRC.PRODUCTS, TARGET TGT.PRODUCTS, FILTER (@RANGE (2,2)); Check that both the Replicat processes exist. GGSCI (dbserver2) 6> info all Program Status Group Lag Time Since Chkpt MANAGER RUNNING REPLICAT STOPPED ROLAP01 00:00:00 00:10:35 REPLICAT STOPPED ROLAP02 00:00:00 00:12:25 Before starting both Replicat processes, obtain the log Sequence Number (SEQNO) and Relative Byte Address (RBA) from the original trail file. GGSCI (dbserver2) 7> info REPLICAT ROLAP01, detail REPLICAT ROLAP01 Last Started 2010-04-01 15:35 Status STOPPED Checkpoint Lag 00:00:00 (updated 00:12:43 ago) Log Read Checkpoint File ./dirdat/tb000279 <- SEQNO 2010-04-08 12:27:00.001016 RBA 43750979 <- RBA Extract Source Begin End ./dirdat/tb000279 2010-04-01 12:47 2010-04-08 12:27 ./dirdat/tb000257 2010-04-01 04:30 2010-04-01 12:47 ./dirdat/tb000255 2010-03-30 13:50 2010-04-01 04:30 ./dirdat/tb000206 2010-03-30 13:50 First Record ./dirdat/tb000206 2010-03-30 04:30 2010-03-30 13:50 ./dirdat/tb000184 2010-03-30 04:30 First Record ./dirdat/tb000184 2010-03-30 00:00 2010-03-30 04:30 ./dirdat/tb000000 *Initialized* 2010-03-30 00:00 ./dirdat/tb000000 *Initialized* First Record Adjust the new Replicat process ROLAP02 to adopt these values, so that the process knows where to start from on startup. GGSCI (dbserver2) 8> alter replicat ROLAP02, extseqno 279 REPLICAT altered. GGSCI (dbserver2) 9> alter replicat ROLAP02, extrba 43750979 REPLICAT altered. Failure to complete this step will result in either duplicate data or ORA-00001 against the target schema, because GoldenGate will attempt to replicate the data from the beginning of the initial trail file (./dirdat/tb000000) if it exists, else the process will abend. Start both Replicat processes. Note the use of the wildcard (*). GGSCI (dbserver2) 10> start replicat ROLAP* Sending START request to MANAGER ... REPLICAT ROLAP01 starting Sending START request to MANAGER ... REPLICAT ROLAP02 starting Check if both Replicat processes are running. GGSCI (dbserver2) 11> info all Program Status Group Lag Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING ROLAP01 00:00:00 00:00:22 REPLICAT RUNNING ROLAP02 00:00:00 00:00:14 Check the detail of the new Replicat processes. GGSCI (dbserver2) 12> info REPLICAT ROLAP02, detail REPLICAT ROLAP02 Last Started 2010-04-08 14:18 Status RUNNING Checkpoint Lag 00:00:00 (updated 00:00:06 ago) Log Read Checkpoint File ./dirdat/tb000279 First Record RBA 43750979 Extract Source Begin End ./dirdat/tb000279 * Initialized * First Record ./dirdat/tb000279 * Initialized * First Record ./dirdat/tb000279 * Initialized * 2010-04-08 12:26 ./dirdat/tb000279 * Initialized * First Record Generate a report for the new Replicat process ROLAP02. GGSCI (dbserver2) 13> send REPLICAT ROLAP02, report Sending REPORT request to REPLICAT ROLAP02 ... Request processed. Now view the report to confirm the new Replicat process has started from the specified start point. (RBA 43750979 and SEQNO 279). The following is an extract from the report: GGSCI (dbserver2) 14> view report ROLAP02 2010-04-08 14:20:18 GGS INFO 379 Positioning with begin time: Apr 08, 2010 14:18:19 PM, starting record time: Apr 08, 2010 14:17:25 PM at extseqno 279, extrba 43750979.  
Read more
  • 0
  • 0
  • 6095

article-image-using-canvas-and-d3
Packt
21 Aug 2014
9 min read
Save for later

Using Canvas and D3

Packt
21 Aug 2014
9 min read
This article by Pablo Navarro Castillo, author of Mastering D3.js, includes that most of the time, we use D3 to create SVG-based charts and visualizations. If the number of elements to render is huge, or if we need to render raster images, it can be more convenient to render our visualizations using the HTML5 canvas element. In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. (For more resources related to this topic, see here.) Creating figures with canvas The HTML canvas element allows you to create raster graphics using JavaScript. It was first introduced in HTML5. It enjoys more widespread support than SVG, and can be used as a fallback option. Before diving deeper into integrating canvas and D3, we will construct a small example with canvas. The canvas element should have the width and height attributes. This alone will create an invisible figure of the specified size: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px"></canvas> If the browser supports the canvas element, it will ignore any element inside the canvas tags. On the other hand, if the browser doesn't support the canvas, it will ignore the canvas tags, but it will interpret the content of the element. This behavior provides a quick way to handle the lack of canvas support: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px"> <!-- Fallback image --> <img src="img/fallback-img.png" width="650" height="60"></img> </canvas> If the browser doesn't support canvas, the fallback image will be displayed. Note that unlike the <img> element, the canvas closing tag (</canvas>) is mandatory. To create figures with canvas, we don't need special libraries; we can create the shapes using the canvas API: <script> // Graphic Variables var barw = 65, barh = 60; // Append a canvas element, set its size and get the node. var canvas = document.getElementById('canvas-demo'); // Get the rendering context. var context = canvas.getContext('2d'); // Array with colors, to have one rectangle of each color. var color = ['#5c3566', '#6c475b', '#7c584f', '#8c6a44', '#9c7c39', '#ad8d2d', '#bd9f22', '#cdb117', '#ddc20b', '#edd400']; // Set the fill color and render ten rectangles. for (var k = 0; k < 10; k += 1) { // Set the fill color. context.fillStyle = color[k]; // Create a rectangle in incremental positions. context.fillRect(k * barw, 0, barw, barh); } </script> We use the DOM API to access the canvas element with the canvas-demo ID and to get the rendering context. Then we set the color using the fillStyle method, and use the fillRect canvas method to create a small rectangle. Note that we need to change fillStyle every time or all the following shapes will have the same color. The script will render a series of rectangles, each one filled with a different color, shown as follows: A graphic created with canvas Canvas uses the same coordinate system as SVG, with the origin in the top-left corner, the horizontal axis augmenting to the right, and the vertical axis augmenting to the bottom. Instead of using the DOM API to get the canvas node, we could have used D3 to create the node, set its attributes, and created scales for the color and position of the shapes. Note that the shapes drawn with canvas don't exists in the DOM tree; so, we can't use the usual D3 pattern of creating a selection, binding the data items, and appending the elements if we are using canvas. Creating shapes Canvas has fewer primitives than SVG. In fact, almost all the shapes must be drawn with paths, and more steps are needed to create a path. To create a shape, we need to open the path, move the cursor to the desired location, create the shape, and close the path. Then, we can draw the path by filling the shape or rendering the outline. For instance, to draw a red semicircle centered in (325, 30) and with a radius of 20, write the following code: // Create a red semicircle. context.beginPath(); context.fillStyle = '#ff0000'; context.moveTo(325, 30); context.arc(325, 30, 20, Math.PI / 2, 3 * Math.PI / 2); context.fill(); The moveTo method is a bit redundant here, because the arc method moves the cursor implicitly. The arguments of the arc method are the x and y coordinates of the arc center, the radius, and the starting and ending angle of the arc. There is also an optional Boolean argument to indicate whether the arc should be drawn counterclockwise. A basic shape created with the canvas API is shown in the following screenshot: Integrating canvas and D3 We will create a small network chart using the force layout of D3 and canvas instead of SVG. To make the graph looks more interesting, we will randomly generate the data. We will generate 250 nodes sparsely connected. The nodes and links will be stored as the attributes of the data object: // Number of Nodes var nNodes = 250, createLink = false; // Dataset Structure var data = {nodes: [],links: []}; We will append nodes and links to our dataset. We will create nodes with a radius attribute randomly assigning it a value of either 2 or 4 as follows: // Iterate in the nodes for (var k = 0; k < nNodes; k += 1) { // Create a node with a random radius. data.nodes.push({radius: (Math.random() > 0.3) ? 2 : 4}); // Create random links between the nodes. } We will create a link with probability of 0.1 only if the difference between the source and target indexes are less than 8. The idea behind this way to create links is to have only a few connections between the nodes: // Create random links between the nodes. for (var j = k + 1; j < nNodes; j += 1) { // Create a link with probability 0.1 createLink = (Math.random() < 0.1) && (Math.abs(k - j) < 8); if (createLink) { // Append a link with variable distance between the nodes data.links.push({ source: k, target: j, dist: 2 * Math.abs(k - j) + 10 }); } } We will use the radius attribute to set the size of the nodes. The links will contain the distance between the nodes and the indexes of the source and target nodes. We will create variables to set the width and height of the figure: // Figure width and height var width = 650, height = 300; We can now create and configure the force layout. As we did in the previous section, we will set the charge strength to be proportional to the area of each node. This time, we will also set the distance between the links, using the linkDistance method of the layout: // Create and configure the force layout var force = d3.layout.force() .size([width, height]) .nodes(data.nodes) .links(data.links) .charge(function(d) { return -1.2 * d.radius * d.radius; }) .linkDistance(function(d) { return d.dist; }) .start(); We can create a canvas element now. Note that we should use the node method to get the canvas element, because the append and attr methods will both return a selection, which don't have the canvas API methods: // Create a canvas element and set its size. var canvas = d3.select('div#canvas-force').append('canvas') .attr('width', width + 'px') .attr('height', height + 'px') .node(); We get the rendering context. Each canvas element has its own rendering context. We will use the '2d' context, to draw two-dimensional figures. At the time of writing this, there are some browsers that support the webgl context; more details are available at https://developer.mozilla.org/en-US/docs/Web/WebGL/Getting_started_with_WebGL. Refer to the following '2d' context: // Get the canvas context. var context = canvas.getContext('2d'); We register an event listener for the force layout's tick event. As canvas doesn't remember previously created shapes, we need to clear the figure and redraw all the elements on each tick: force.on('tick', function() { // Clear the complete figure. context.clearRect(0, 0, width, height); // Draw the links ... // Draw the nodes ... }); The clearRect method cleans the figure under the specified rectangle. In this case, we clean the entire canvas. We can draw the links using the lineTo method. We iterate through the links, beginning a new path for each link, moving the cursor to the position of the source node, and by creating a line towards the target node. We draw the line with the stroke method: // Draw the links data.links.forEach(function(d) { // Draw a line from source to target. context.beginPath(); context.moveTo(d.source.x, d.source.y); context.lineTo(d.target.x, d.target.y); context.stroke(); }); We iterate through the nodes and draw each one. We use the arc method to represent each node with a black circle: // Draw the nodes data.nodes.forEach(function(d, i) { // Draws a complete arc for each node. context.beginPath(); context.arc(d.x, d.y, d.radius, 0, 2 * Math.PI, true); context.fill(); }); We obtain a constellation of disconnected network graphs, slowly gravitating towards the center of the figure. Using the force layout and canvas to create a network chart is shown in the following screenshot: We can think that to erase all the shapes and redraw each shape again and again could have a negative impact on the performance. In fact, sometimes it's faster to draw the figures using canvas, because this way the browser doesn't have to manage the DOM tree of the SVG elements (but it still have to redraw them if the SVG elements are changed). Summary In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. Resources for Article: Further resources on this subject: Interacting with Data for Dashboards [Article] Kendo UI DataViz – Advance Charting [Article] Visualizing my Social Graph with d3.js [Article]
Read more
  • 0
  • 0
  • 6090

article-image-putting-fun-functional-python
Packt
17 Feb 2016
21 min read
Save for later

Putting the Fun in Functional Python

Packt
17 Feb 2016
21 min read
Functional programming defines a computation using expressions and evaluation—often encapsulated in function definitions. It de-emphasizes or avoids the complexity of state change and mutable objects. This tends to create programs that are more succinct and expressive. In this article, we'll introduce some of the techniques that characterize functional programming. We'll identify some of the ways to map these features to Python. Finally, we'll also address some ways in which the benefits of functional programming accrue when we use these design patterns to build Python applications. Python has numerous functional programming features. It is not a purely functional programming language. It offers enough of the right kinds of features that it confers to the benefits of functional programming. It also retains all optimization power available from an imperative programming language. We'll also look at a problem domain that we'll use for many of the examples in this book. We'll try to stick closely to Exploratory Data Analysis (EDA) because its algorithms are often good examples of functional programming. Furthermore, the benefits of functional programming accrue rapidly in this problem domain. Our goal is to establish some essential principles of functional programming. We'll focus on Python 3 features in this book. However, some of the examples might also work in Python 2. (For more resources related to this topic, see here.) Identifying a paradigm It's difficult to be definitive on what fills the universe of programming paradigms. For our purposes, we will distinguish between just two of the many programming paradigms: Functional programming and Imperative programming. One important distinguishing feature between these two is the concept of state. In an imperative language, like Python, the state of the computation is reflected by the values of the variables in the various namespaces. The values of the variables establish the state of a computation; each kind of statement makes a well-defined change to the state by adding or changing (or even removing) a variable. A language is imperative because each statement is a command, which changes the state in some way. Our general focus is on the assignment statement and how it changes state. Python has other statements, such as global or nonlocal, which modify the rules for variables in a particular namespace. Statements like def, class, and import change the processing context. Other statements like try, except, if, elif, and else act as guards to modify how a collection of statements will change the computation's state. Statements like for and while, similarly, wrap a block of statements so that the statements can make repeated changes to the state of the computation. The focus of all these various statement types, however, is on changing the state of the variables. Ideally, each statement advances the state of the computation from an initial condition toward the desired final outcome. This "advances the computation" assertion can be challenging to prove. One approach is to define the final state, identify a statement that will establish this final state, and then deduce the precondition required for this final statement to work. This design process can be iterated until an acceptable initial state is derived. In a functional language, we replace state—the changing values of variables—with a simpler notion of evaluating functions. Each function evaluation creates a new object or objects from existing objects. Since a functional program is a composition of a function, we can design lower-level functions that are easy to understand, and we will design higher-level compositions that can also be easier to visualize than a complex sequence of statements. Function evaluation more closely parallels mathematical formalisms. Because of this, we can often use simple algebra to design an algorithm, which clearly handles the edge cases and boundary conditions. This makes us more confident that the functions work. It also makes it easy to locate test cases for formal unit testing. It's important to note that functional programs tend to be relatively succinct, expressive, and efficient when compared to imperative (object-oriented or procedural) programs. The benefit isn't automatic; it requires a careful design. This design effort is often easier than functionally similar procedural programming. Subdividing the procedural paradigm We can subdivide imperative languages into a number of discrete categories. In this section, we'll glance quickly at the procedural versus object-oriented distinction. What's important here is to see how object-oriented programming is a subset of imperative programming. The distinction between procedural and object-orientation doesn't reflect the kind of fundamental difference that functional programming represents. We'll use code examples to illustrate the concepts. For some, this will feel like reinventing a wheel. For others, it provides a concrete expression of abstract concepts. For some kinds of computations, we can ignore Python's object-oriented features and write simple numeric algorithms. For example, we might write something like the following to get the range of numbers: s = 0 for n in range(1, 10): if n % 3 == 0 or n % 5 == 0: s += n print(s) We've made this program strictly procedural, avoiding any explicit use of Python's object features. The program's state is defined by the values of the variables s and n. The variable, n, takes on values such that 1 ≤ n < 10. As the loop involves an ordered exploration of values of n, we can prove that it will terminate when n == 10. Similar code would work in C or Java using their primitive (non-object) data types. We can exploit Python's Object-Oriented Programming (OOP) features and create a similar program: m = list() for n in range(1, 10): if n % 3 == 0 or n % 5 == 0: m.append(n) print(sum(m)) This program produces the same result but it accumulates a stateful collection object, m, as it proceeds. The state of the computation is defined by the values of the variables m and n. The syntax of m.append(n) and sum(m) can be confusing. It causes some programmers to insist (wrongly) that Python is somehow not purely Object-oriented because it has a mixture of the function()and object.method() syntax. Rest assured, Python is purely Object-oriented. Some languages, like C++, allow the use of primitive data type such as int, float, and long, which are not objects. Python doesn't have these primitive types. The presence of prefix syntax doesn't change the nature of the language. To be pedantic, we could fully embrace the object model, the subclass, the list class, and add a sum method: class SummableList(list): def sum( self ): s= 0 for v in self.__iter__(): s += v return s If we initialize the variable, m, with the SummableList() class instead of the list() method, we can use the m.sum() method instead of the sum(m) method. This kind of change can help to clarify the idea that Python is truly and completely object-oriented. The use of prefix function notation is purely syntactic sugar. All three of these examples rely on variables to explicitly show the state of the program. They rely on the assignment statements to change the values of the variables and advance the computation toward completion. We can insert the assert statements throughout these examples to demonstrate that the expected state changes are implemented properly. The point is not that imperative programming is broken in some way. The point is that functional programming leads to a change in viewpoint, which can, in many cases, be very helpful. We'll show a function view of the same algorithm. Functional programming doesn't make this example dramatically shorter or faster. Using the functional paradigm In a functional sense, the sum of the multiples of 3 and 5 can be defined in two parts: The sum of a sequence of numbers A sequence of values that pass a simple test condition, for example, being multiples of three and five The sum of a sequence has a simple, recursive definition: def sum(seq): if len(seq) == 0: return 0 return seq[0] + sum(seq[1:]) We've defined the sum of a sequence in two cases: the base case states that the sum of a zero length sequence is 0, while the recursive case states that the sum of a sequence is the first value plus the sum of the rest of the sequence. Since the recursive definition depends on a shorter sequence, we can be sure that it will (eventually) devolve to the base case. The + operator on the last line of the preceeding example and the initial value of 0 in the base case characterize the equation as a sum. If we change the operator to * and the initial value to 1, it would just as easily compute a product. Similarly, a sequence of values can have a simple, recursive definition, as follows: def until(n, filter_func, v): if v == n: return [] if filter_func(v): return [v] + until( n, filter_func, v+1 ) else: return until(n, filter_func, v+1) In this function, we've compared a given value, v, against the upper bound, n. If v reaches the upper bound, the resulting list must be empty. This is the base case for the given recursion. There are two more cases defined by the given filter_func() function. If the value of v is passed by the filter_func() function, we'll create a very small list, containing one element, and append the remaining values of the until() function to this list. If the value of v is rejected by the filter_func() function, this value is ignored and the result is simply defined by the remaining values of the until() function. We can see that the value of v will increase from an initial value until it reaches n, assuring us that we'll reach the base case soon. Here's how we can use the until() function to generate the multiples of 3 or 5. First, we'll define a handy lambda object to filter values: mult_3_5= lambda x: x%3==0 or x%5==0 (We will use lambdas to emphasize succinct definitions of simple functions. Anything more complex than a one-line expression requires the def statement.) We can see how this lambda works from the command prompt in the following example: >>> mult_3_5(3) True >>> mult_3_5(4) False >>> mult_3_5(5) True This function can be used with the until() function to generate a sequence of values, which are multiples of 3 or 5. The until() function for generating a sequence of values works as follows: >>> until(10, lambda x: x%3==0 or x%5==0, 0) [0, 3, 5, 6, 9] We can use our recursive sum() function to compute the sum of this sequence of values. The various functions, such as sum(), until(), and mult_3_5() are defined as simple recursive functions. The values are computed without restoring to use intermediate variables to store state. We'll return to the ideas behind this purely functional recursive function definition in several places. It's important to note here that many functional programming language compilers can optimize these kinds of simple recursive functions. Python can't do the same optimizations. Using a functional hybrid We'll continue this example with a mostly functional version of the previous example to compute the sum of the multiples of 3 and 5. Our hybrid functional version might look like the following: print( sum(n for n in range(1, 10) if n%3==0 or n%5==0) ) We've used nested generator expressions to iterate through a collection of values and compute the sum of these values. The range(1, 10) method is an iterable and, consequently, a kind of generator expression; it generates a sequence of values . The more complex expression, n for n in range(1, 10) if n%3==0 or n%5==0, is also an iterable expression. It produces a set of values . A variable, n, is bound to each value, more as a way of expressing the contents of the set than as an indicator of the state of the computation. The sum() function consumes the iterable expression, creating a final object, 23. The bound variable doesn't change once a value is bound to it. The variable, n, in the loop is essentially a shorthand for the values available from the range() function. The if clause of the expression can be extracted into a separate function, allowing us to easily repurpose this with other rules. We could also use a higher-order function named filter() instead of the if clause of the generator expression. As we work with generator expressions, we'll see that the bound variable is at the blurry edge of defining the state of the computation. The variable, n, in this example isn't directly comparable to the variable, n, in the first two imperative examples. The for statement creates a proper variable in the local namespace. The generator expression does not create a variable in the same way as a for statement does: >>> sum( n for n in range(1, 10) if n%3==0 or n%5==0 ) 23 >>> n Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'n' is not defined Because of the way Python uses namespaces, it might be possible to write a function that can observe the n variable in a generator expression. However, we won't. Our objective is to exploit the functional features of Python, not to detect how those features have an object-oriented implementation under the hood. Looking at object creation In some cases, it might help to look at intermediate objects as a history of the computation. What's important is that the history of a computation is not fixed. When functions are commutative or associative, then changes to the order of evaluation might lead to different objects being created. This might have performance improvements with no changes to the correctness of the results. Consider this expression: >>> 1+2+3+4 10 We are looking at a variety of potential computation histories with the same result. Because the + operator is commutative and associative, there are a large number of candidate histories that lead to the same result. Of the candidate sequences, there are two important alternatives, which are as follows: >>> ((1+2)+3)+4 10 >>> 1+(2+(3+4)) 10 In the first case, we fold in values working from left to right. This is the way Python works implicitly. Intermediate objects 3 and 6 are created as part of this evaluation. In the second case, we fold from right-to-left. In this case, intermediate objects 7 and 9 are created. In the case of simple integer arithmetic, the two results have identical performance; there's no optimization benefit. When we work with something like the list append, we might see some optimization improvements when we change the association rules. Here's a simple example: >>> import timeit >>> timeit.timeit("((([]+[1])+[2])+[3])+[4]") 0.8846941249794327 >>> timeit.timeit("[]+([1]+([2]+([3]+[4])))") 1.0207440659869462 In this case, there's some benefit in working from left to right. What's important for functional design is the idea that the + operator (or add() function) can be used in any order to produce the same results. The + operator has no hidden side effects that restrict the way this operator can be used. The stack of turtles When we use Python for functional programming, we embark down a path that will involve a hybrid that's not strictly functional. Python is not Haskell, OCaml, or Erlang. For that matter, our underlying processor hardware is not functional; it's not even strictly object-oriented—CPUs are generally procedural. All programming languages rest on abstractions, libraries, frameworks and virtual machines. These abstractions, in turn, may rely on other abstractions, libraries, frameworks and virtual machines. The most apt metaphor is this: the world is carried on the back of a giant turtle. The turtle stands on the back of another giant turtle. And that turtle, in turn, is standing on the back of yet another turtle. It's turtles all the way down.                                                                                                             – Anonymous
There's no practical end to the layers of abstractions. More importantly, the presence of abstractions and virtual machines doesn't materially change our approach to designing software to exploit the functional programming features of Python. Even within the functional programming community, there are more pure and less pure functional programming languages. Some languages make extensive use of monads to handle stateful things like filesystem input and output. Other languages rely on a hybridized environment that's similar to the way we use Python. We write software that's generally functional with carefully chosen procedural exceptions. Our functional Python programs will rely on the following three stacks of abstractions: Our applications will be functions—all the way down—until we hit the objects The underlying Python runtime environment that supports our functional programming is objects—all the way down—until we hit the turtles The libraries that support Python are a turtle on which Python stands The operating system and hardware form their own stack of turtles. These details aren't relevant to the problems we're going to solve. A classic example of functional programming As part of our introduction, we'll look at a classic example of functional programming. This is based on the classic paper Why Functional Programming Matters by John Hughes. The article appeared in a paper called Research Topics in Functional Programming, edited by D. Turner, published by Addison-Wesley in 1990. Here's a link to the paper Research Topics in Functional Programming: http://www.cs.kent.ac.uk/people/staff/dat/miranda/whyfp90.pdf This discussion of functional programming in general is profound. There are several examples given in the paper. We'll look at just one: the Newton-Raphson algorithm for locating the roots of a function. In this case, the function is the square root. It's important because many versions of this algorithm rely on the explicit state managed via loops. Indeed, the Hughes paper provides a snippet of the Fortran code that emphasizes stateful, imperative processing. The backbone of this approximation is the calculation of the next approximation from the current approximation. The next_() function takes x, an approximation to the sqrt(n) method and calculates a next value that brackets the proper root. Take a look at the following example: def next_(n, x): return (x+n/x)/2 This function computes a series of values . The distance between the values is halved each time, so they'll quickly get to converge on the value such that, which means . We don't want to call the method next() because this name would collide with a built-in function. We call it the next_() method so that we can follow the original presentation as closely as possible. Here's how the function looks when used in the command prompt: >>> n= 2 >>> f= lambda x: next_(n, x) >>> a0= 1.0 >>> [ round(x,4) for x in (a0, f(a0), f(f(a0)), f(f(f(a0))),) ] [1.0, 1.5, 1.4167, 1.4142] We've defined the f() method as a lambda that will converge on . We started with 1.0 as the initial value for . Then we evaluated a sequence of recursive evaluations: , and so on. We evaluated these functions using a generator expression so that we could round off each value. This makes the output easier to read and easier to use with doctest. The sequence appears to converge rapidly on . We can write a function, which will (in principle) generate an infinite sequence of values converging on the proper square root: def repeat(f, a): yield a for v in repeat(f, f(a)): yield v This function will generate approximations using a function, f(), and an initial value, a. If we provide the next_() function defined earlier, we'll get a sequence of approximations to the square root of the n argument. The repeat() function expects the f() function to have a single argument, however, our next_() function has two arguments. We can use a lambda object, lambda x: next_(n, x), to create a partial version of the next_() function with one of two variables bound. The Python generator functions can't be trivially recursive, they must explicitly iterate over the recursive results, yielding them individually. Attempting to use a simple return repeat(f, f(a)) will end the iteration, returning a generator expression instead of yielding the sequence of values. We have two ways to return all the values instead of returning a generator expression, which are as follows: We can write an explicit for loop as follows: for x in some_iter: yield x. We can use the yield from statement as follows: yield from some_iter. Both techniques of yielding the values of a recursive generator function are equivalent. We'll try to emphasize yield from. In some cases, however, the yield with a complex expression will be more clear than the equivalent mapping or generator expression. Of course, we don't want the entire infinite sequence. We will stop generating values when two values are so close to each other that we can call either one the square root we're looking for. The common symbol for the value, which is close enough, is the Greek letter Epsilon, ε, which can be thought of as the largest error we will tolerate. In Python, we'll have to be a little clever about taking items from an infinite sequence one at a time. It works out well to use a simple interface function that wraps a slightly more complex recursion. Take a look at the following code snippet: def within(ε, iterable): def head_tail(ε, a, iterable): b= next(iterable) if abs(a-b) <= ε: return b return head_tail(ε, b, iterable) return head_tail(ε, next(iterable), iterable) We've defined an internal function, head_tail(), which accepts the tolerance, ε, an item from the iterable sequence, a, and the rest of the iterable sequence, iterable. The next item from the iterable bound to a name b. If , then the two values that are close enough together that we've found the square root. Otherwise, we use the b value in a recursive invocation of the head_tail() function to examine the next pair of values. Our within() function merely seeks to properly initialize the internal head_tail() function with the first value from the iterable parameter. Some functional programming languages offer a technique that will put a value back into an iterable sequence. In Python, this might be a kind of unget() or previous() method that pushes a value back into the iterator. Python iterables don't offer this kind of rich functionality. We can use the three functions next_(), repeat(), and within() to create a square root function, as follows: def sqrt(a0, ε, n): return within(ε, repeat(lambda x: next_(n,x), a0)) We've used the repeat() function to generate a (potentially) infinite sequence of values based on the next_(n,x) function. Our within() function will stop generating values in the sequence when it locates two values with a difference less than ε. When we use this version of the sqrt() method, we need to provide an initial seed value, a0, and an ε value. An expression like sqrt(1.0, .0001, 3) will start with an approximation of 1.0 and compute the value of to within 0.0001. For most applications, the initial a0 value can be 1.0. However, the closer it is to the actual square root, the more rapidly this method converges. The original example of this approximation algorithm was shown in the Miranda language. It's easy to see that there are few profound differences between Miranda and Python. The biggest difference is Miranda's ability to construct cons, a value back into an iterable, doing a kind of unget. This parallelism between Miranda and Python gives us confidence that many kinds of functional programming can be easily done in Python. Summary We've looked at programming paradigms with an eye toward distinguishing the functional paradigm from two common imperative paradigms in details. For more information kindly take a look at the following books, also by Packt Publishing: Learning Python (https://www.packtpub.com/application-development/learning-python) Mastering Python (https://www.packtpub.com/application-development/mastering-python) Mastering Object-oriented Python (https://www.packtpub.com/application-development/mastering-object-oriented-python) Resources for Article: Further resources on this subject: Saying Hello to Unity and Android [article] Using Specular in Unity [article] Unity 3.x Scripting-Character Controller versus Rigidbody [article]
Read more
  • 0
  • 1
  • 6081
article-image-choose-r-data-mining-project
Fatema Patrawala
15 Feb 2018
9 min read
Save for later

Why choose R for your data mining project

Fatema Patrawala
15 Feb 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt taken from the book R Data Mining, written by Andrea Cirillo. If you are a budding data scientist or a data analyst with basic knowledge of R, and you want to get into the intricacies of data mining in a practical manner, be sure to check out this book.[/box] In today’s post, we will analyze R's strengths, and understand why it is a savvy idea to learn this programming language for data mining. R's strengths You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular? If looking at the root causes of R's popularity, we definitely have to mention these three: Open source inside Plugin ready Data visualization friendly Open source inside One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well. These attributes fit well for almost every target user of a statistical analysis language: Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses Plugin ready You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game. There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to file system manipulation, statistical analysis, and data Visualization. While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages. This is basically how the package development and sharing flow works: The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper. The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R related documents and packages. Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor): install.packages("ggplot2") library(ggplot2) As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is. More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community. Data visualization friendly As a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data. Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields. R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks: This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it. Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers. To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field. Engaging with the community to learn R Now that we are aware of R’s popularity we need to engage with the community to take advantage of it. We will look at alternative and non-exclusive ways of engaging with the community: Employing community-driven learning material Asking for help from the community Staying ahead of language developments Employing community-driven learning material: There are two main kinds of R learning materials developed by the community: Papers, manuals, and books Online interactive courses Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books. Let me point out to you the more useful ones: Advanced R R for Data Science Introduction to Statistical Learning OpenIntro Statistics The R Journal Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff. Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question. Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there): Stack Overflow R-help mailing list R packages documentation I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered. If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers. Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language. To summarize, we looked why to choose R as your programming language for data mining and how we can engage with the R community. If you think this post is useful, you may further check out this book R Data Mining, to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more.    
Read more
  • 0
  • 0
  • 6057

article-image-hands-tutorial-getting-started-amazon-simpledb
Packt
28 May 2010
5 min read
Save for later

Hands-on Tutorial for Getting Started with Amazon SimpleDB

Packt
28 May 2010
5 min read
(For more resources on SimpleDB, see here.) Creating an AWS account In order to start using SimpleDB, you will first need to sign up for an account with AWS. Visit http://aws.amazon.com/ and click on Create an AWS Account. You can sign up either by using your e-mail address for an existing Amazon account, or by creating a completely new account. You may wish to have multiple accounts to separate billing for projects. This could make it easier for you to track billing for separate accounts. After a successful signup, navigate to the main AWS page— http://aws.amazon.com/, and click on the Your Account link at any time to view your account information and make any changes to it if needed. Enabling SimpleDB service for AWS account Once you have successfully set up an AWS account, you must follow these steps to enable the SimpleDB service for your account: Log in to your AWS account. Navigate to the SimpleDB home page—http://aws.amazon.com/simpledb/. Click on the Sign Up For Amazon SimpleDB button on the right side of the page. Provide the requested credit card information and complete the signup process. You have now successfully set up your AWS account and enabled it for SimpleDB. All communication with SimpleDB or any of the Amazon web services must be through either the SOAP interface or the Query/ReST interface. The request messages sent through either of these interfaces is digitally signed by the sending user in order to ensure that the messages have not been tampered within transit, and that they really originate from the sending user. Requests that use the Query/ReST interface will use the access keys for signing the request, whereas requests to the SOAP interface will use the x.509 certificates. Your new AWS account is associated with the following items: A unique 12-digit AWS account number for identifying your account. AWS Access Credentials are used for the purpose of authenticating requests made by you through the ReST Request API to any of the web services provided by AWS. An initial set of keys is automatically generated for you by default. You can regenerate the Secret Access Key at any time if you like. Keep in mind that when you generate a new access key, all requests made using the old key will be rejected. An Access Key ID identifies you as the person making requests to a web service. A Secret Access Key is used to calculate the digital signature when you make requests to the web service. Be careful with your Secret Access Key, as it provides full access to the account, including the ability to delete all of your data. All requests made to any of the web services provided by AWS using the SOAP protocol use the X.509 security certificate for authentication. There are no default certificates generated automatically for you by AWS. You must generate the certificate by clicking on the Create a new Certificate link, then download them to your computer and make them available to the machine that will be making requests to AWS. Public and private key for the x.509 certificate. You can either upload your own x.509 certificate if you already have one, or you can just generate a new certificate and then download it to your computer. Query API and authentication There are two interfaces to SimpleDB. The SOAP interface uses the SOAP protocol for the messages, while the ReST Requests uses HTTP requests with request parameters to describe the various SimpleDB methods and operations. In this book, we will be focusing on using the ReST Requests for talking to SimpleDB, as it is a much simpler protocol and utilizes straightforward HTTP-based requests and responses for communication, and the requests are sent to SimpleDB using either a HTTP GET or POST method. The ReST Requests need to be authenticated in order to establish that they are originating from a valid SimpleDB user, and also for accounting and billing purposes. This authentication is performed using your access key identifiers. Every request to SimpleDB must contain a request signature calculated by constructing a string based on the Query API and then calculating an RFC 2104-compliant HMAC-SHA1 hash, using the Secret Access Key. The basic steps in the authentication of a request by SimpleDB are: You construct a request to SimpleDB. You use your Secret Access Key to calculate the request signature, a Keyed-Hashing for Message Authentication code (HMAC) with an SHA1 hash function. You send the request data, the request signature, timestamp, and your Access Key ID to AWS. AWS uses the Access Key ID in the request to look up the associated Secret Access Key. AWS generates a request signature from the request data using the retrieved Secret Access Key and the same algorithm you used to calculate the signature in the request. If the signature generated by AWS matches the one you sent in the request, the request is considered to be authentic. If the signatures are different, the request is discarded, and AWS returns an error response. If the timestamp is older than 15 minutes, the request is rejected. The procedure for constructing your requests is simple, but tedious and time consuming. This overview was intended to make you familiar with the entire process, but don't worry—you will not need to go through this laborious process every single time that you interact with SimpleDB. Instead, we will be leveraging one of the available libraries for communicating with SimpleDB, which encapsulates a lot of the repetitive stuff for us and makes it simple to dive straight into playing with and exploring SimpleDB!
Read more
  • 0
  • 0
  • 6003

article-image-what-hazelcast
Packt
22 Aug 2013
10 min read
Save for later

What is Hazelcast?

Packt
22 Aug 2013
10 min read
(For more resources related to this topic, see here.) Starting out as usual In most modern software systems, data is the key. For more traditional architectures, the role of persisting and providing access to your system's data tends to fall to a relational database. Typically this is a monolithic beast, perhaps with a degree of replication, although this tends to be more for resilience rather than performance. For example, here is what a traditional architecture might look like (which hopefully looks rather familiar) This presents us with an issue in terms of application scalability, in that it is relatively easy to scale our application layer by throwing more hardware at it to increase the processing capacity. But the monolithic constraints of our data layer would only allow us to do this so far before diminishing returns or resource saturation stunted further performance increases; so what can we do to address this? In the past and in legacy architectures, the only solution would be to increase the performance capability of our database infrastructure, potentially by buying a bigger, faster server or by further tweaking and fettling the utilization of currently available resources. Both options are dramatic, either in terms of financial cost and/or manpower; so what else could we do? Data deciding to hang around In order for us to gain a bit more performance out of our existing setup, we can hold copies of our data away from the primary database and use these in preference wherever possible. There are a number of different strategies we could adopt, from transparent second-level caching layers to external key-value object storage. The detail and exact use of each varies significantly depending on the technology or its place in the architecture, but the main desire of these systems is to sit alongside the primary database infrastructure and attempt to protect it from an excessive load. This would then tend to lead to an increased performance of the primary database by reducing the overall dependency on it. However, this strategy tends to be only particularly valuable as a short-term solution, effectively buying us a little more time before the database once again starts to reach saturation. The other downside is that it only protects our database from read-based load; if our application is predominately write-heavy, this strategy has very little to offer. So our expanded architecture could look a bit like the following figure: Therein lies the problem However, in insulating the database from the read load, we have introduced a problem in the form of a cache consistency issue, in that, how does our local data cache deal with changing data underneath it within the primary database? The answer is rather depressing: it can't! The exact manifestation of any issues will largely depend on the data needs of the application and how frequently the data changes; but typically, caching systems will operate in one of the two following modes to combat the problem: Time bound cache: Holds entries for a defined period (time-to-live or TTL) Write through cache: Holds entries until they are invalidated by subsequent updates Time bound caches almost always have consistency issues, but at least the amount of time that the issue would be present is limited to the expiry time of each entry. However, we must consider the application's access to this data, because if the frequency of accessing a particular entry is less than the cache expiry time of it, the cache is providing no real benefit. Write through caches are consistent in isolation and can be configured to offer strict consistency, but if multiple write through caches exist within the overall architecture, then there will be consistency issues between them. We can avoid this by having a more intelligent cache, which features a communication mechanism between nodes, that can propagate entry invalidations to each other. In practice, an ideal cache would feature a combination of both features, so that entries would be held for a known maximum time, but also passes around invalidations as changes are made. So our evolved architecture would look a bit like the following figure: So far we've had a look through the general issues in scaling our data layer, and introduced strategies to help combat the trade-offs we will encounter along the way; however, the real world isn't quite as simple. There are various cache servers and in-memory database products in this area: however, most of these are stand-alone single instances, perhaps with some degree of distribution bolted on or provided by other supporting technologies. This tends to bring about the same issues we experienced with just our primary database, in that we could encounter resource saturation or capacity issues if the product is a single instance, or if the distribution doesn't provide consistency control, perhaps inconsistent data, which might harm our application. Breaking the mould Hazelcast is a radical new approach to data, designed from the ground up around distribution. It embraces a new scalable way of thinking; in that data should be shared around for both resilience and performance, while allowing us to configure the trade-offs surrounding consistency as the data requirements dictate. The first major feature to understand about Hazelcast is its master less nature; each node is configured to be functionally the same. The oldest node in the cluster is the de facto leader and manages the membership, automatically delegating as to which node is responsible for what data. In this way as new nodes join or dropout, the process is repeated and the cluster rebalances accordingly. This makes Hazelcast incredibly simple to get up and running, as the system is self-discovering, self-clustering, and works straight out of the box. However, the second feature to remember is that we are persisting data entirely in-memory; this makes it incredibly fast but this speed comes at a price. When a node is shutdown, all the dta that was held on it is lost. We combat this risk to resilience through replication, by holding enough copies of a piece of data across multiple nodes. In the event of failure, the overall cluster will not suffer any data loss. By default, the standard backup count is 1, so we can immediately enjoy basic resilience. But don't pull the plug on more than one node at a time, until the cluster has reacted to the change in membership and reestablished the appropriate number of backup copies of data. So when we introduce our new master less distributed cluster, we get something like the following figure: We previously identified that multi-node caches tend to suffer from either saturation or consistency issues. In the case of Hazelcast, each node is the owner of a number of partitions of the overall data, so the load will be fairly spread across the cluster. Hence, any saturation would be at the cluster level rather than any individual node. We can address this issue simply by adding more nodes. In terms of consistency, by default the backup copies of the data are internal to Hazelcast and not directly used, as such we enjoy strict consistency. This does mean that we have to interact with a specific node to retrieve or update a particular piece of data; however, exactly which node that is an internal operational detail and can vary over time — we as developers never actually need to know. If we imagine that our data is split into a number of partitions, that each partition slice is owned by one node and backed up on another, we could then visualize the interactions like the following figure: This means that for data belonging to Partition 1, our application will have to communicate to Node 1, Node 2 for data belonging to Partition 2, and so on. The slicing of the data into each partition is dynamic; so in practice, where there are more partitions than nodes, each node will own a number of different partitions and hold backups for others. As we have mentioned before, all of this is an internal operational detail, and our application does not need to know it, but it is important that we understand what is going on behind the scenes. Moving to new ground So far we have been talking mostly about simple persisted data and caches, but in reality, we should not think of Hazelcast as purely a cache, as it is much more powerful than just that. It is an in-memory data grid that supports a number of distributed collections and features. We can load in data from various sources into differing structures, send messages across the cluster; take out locks to guard against concurrent activity, and listen to the goings on inside the workings of the cluster. Most of these implementations correspond to a standard Java collection, or function in a manner comparable to other similar technologies, but all with the distribution and resilience capabilities already built in. Standard utility collections Map: Key-value pairs List: Collection of objects? Set: Non-duplicated collection Queue: Offer/poll FIFO collection Specialized collection Multi-Map: Key-list of values collection Lock: Cluster wide mutex Topic: Publish/subscribe messaging Concurrency utilities AtomicNumber: Cluster-wide atomic counter IdGenerator: Cluster-wide unique identifier generation Semaphore: Concurrency limitation CountdownLatch: Concurrent activity gate-keeping Listeners: Application notifications as things happen In addition to data storage collections, Hazelcast also features a distributed executor service allowing runnable tasks to be created that can be run anywhere on the cluster to obtain, manipulate, and store results. We could have a number of collections containing source data, then spin up a number of tasks to process the disparate data (for example, averaging or aggregating) and outputting the results into another collection for consumption. Again, just as we could scale up our data capacities by adding more nodes, we can also increase the execution capacity in exactly the same way. This essentially means that by building our data layer around Hazelcast, if our application needs rapidly increase, we can continuously increase the number of nodes to satisfy seemingly extensive demands, all without having to redesign or re-architect the actual application. With Hazelcast, we are dealing more with a technology than a server product, a library to build a system around rather than retrospectively bolting it on, or blindly connecting to an off-the-shelf commercial system. While it is possible (and in some simple cases quite practical) to run Hazelcast as a separate server-like cluster and connect to it remotely from our application; some of the greatest benefits come when we develop our own classes and tasks run within it and alongside it. With such a large range of generic capabilities, there is an entire world of problems that Hazelcast can help solve. We can use the technology in many ways; in isolation to hold data such as user sessions, run it alongside a more long-term persistent data store to increase capacity, or shift towards performing high performance and scalable operations on our data. By moving more and more responsibility away from monolithic systems to such a generic scalable one, there is no limit to the performance we can unlock. This will allow us to keep our application and data layers separate, but enabling the ability to scale them up independently as our application grows. This will avoid our application becoming a victim of its own success, while hopefully taking the world by storm. Summary In this article, we learned about Hazelcast. With such a large range of generic capabilities, Hazelcast can solve a world of problems. Resources for Article: Further resources on this subject: JBoss AS Perspective [Article] Drools Integration Modules: Spring Framework and Apache Camel [Article] JBoss RichFaces 3.3 Supplemental Installation [Article]
Read more
  • 0
  • 0
  • 5980
article-image-installing-postgresql
Packt
01 Apr 2015
16 min read
Save for later

Installing PostgreSQL

Packt
01 Apr 2015
16 min read
In this article by Hans-Jürgen Schönig, author of the book Troubleshooting PostgreSQL, we will cover what can go wrong during the installation process and what can be done to avoid those things from happening. At the end of the article, you should be able to avoid all of the pitfalls, traps, and dangers you might face during the setup process. (For more resources related to this topic, see here.) For this article, I have compiled some of the core problems that I have seen over the years, as follows: Deciding on a version during installation Memory and kernel issues Preventing problems by adding checksums to your database instance Wrong encodings and subsequent import errors Polluted template databases Killing the postmaster badly At the end of the article, you should be able to install PostgreSQL and protect yourself against the most common issues popping up immediately after installation. Deciding on a version number The first thing to work on when installing PostgreSQL is to decide on the version number. In general, a PostgreSQL version number consists of three digits. Here are some examples: 9.4.0, 9.4.1, or 9.4.2 9.3.4, 9.3.5, or 9.3.6 The last digit is the so-called minor release. When a new minor release is issued, it generally means that some bugs have been fixed (for example, some time zone changes, crashes, and so on). There will never be new features, missing functions, or changes of that sort in a minor release. The same applies to something truly important—the storage format. It won't change with a new minor release. These little facts have a wide range of consequences. As the binary format and the functionality are unchanged, you can simply upgrade your binaries, restart PostgreSQL, and enjoy your improved minor release. When the digit in the middle changes, things get a bit more complex. A changing middle digit is called a major release. It usually happens around once a year and provides you with significant new functionality. If this happens, we cannot just stop or start the database anymore to replace the binaries. If the first digit changes, something really important has happened. Examples of such important events were introductions of SQL (6.0), the Windows port (8.0), streaming replication (9.0), and so on. Technically, there is no difference between the first and the second digit—they mean the same thing to the end user. However, a migration process is needed. The question that now arises is this: if you have a choice, which version of PostgreSQL should you use? Well, in general, it is a good idea to take the latest stable release. In PostgreSQL, every version number following the design patterns I just outlined is a stable release. As of PostgreSQL 9.4, the PostgreSQL community provides fixes for versions as old as PostgreSQL 9.0. So, if you are running an older version of PostgreSQL, you can still enjoy bug fixes and so on. Methods of installing PostgreSQL Before digging into troubleshooting itself, the installation process will be outlined. The following choices are available: Installing binary packages Installing from source Installing from source is not too hard to do. However, this article will focus on installing binary packages only. Nowadays, most people (not including me) like to install PostgreSQL from binary packages because it is easier and faster. Basically, two types of binary packages are common these days: RPM (Red Hat-based) and DEB (Debian-based). Installing RPM packages Most Linux distributions include PostgreSQL. However, the shipped PostgreSQL version is somewhat ancient in many cases. Recently, I saw a Linux distribution that still featured PostgreSQL 8.4, a version already abandoned by the PostgreSQL community. Distributors tend to ship older versions to ensure that new bugs are not introduced into their systems. For high-performance production servers, outdated versions might not be the best idea, however. Clearly, for many people, it is not feasible to run long-outdated versions of PostgreSQL. Therefore, it makes sense to make use of repositories provided by the community. The Yum repository shows which distributions we can use RPMs for, at http://yum.postgresql.org/repopackages.php. Once you have found your distribution, the first thing is to install this repository information for Fedora 20 as it is shown in the next listing: yum install http://yum.postgresql.org/9.4/fedora/fedora-20-x86_64/pgdg-fedora94-9.4-1.noarch.rpm Once the repository has been added, we can install PostgreSQL: yum install postgresql94-server postgresql94-contrib /usr/pgsql-9.4/bin/postgresql94-setup initdb systemctl enable postgresql-9.4.service systemctl start postgresql-9.4.service First of all, PostgreSQL 9.4 is installed. Then a so-called database instance is created (initdb). Next, the service is enabled to make sure that it is always there after a reboot, and finally, the postgresql-9.4 service is started. The term database instance is an important concept. It basically describes an entire PostgreSQL environment (setup). A database instance is fired up when PostgreSQL is started. Databases are part of a database instance. Installing Debian packages Installing Debian packages is also not too hard. By the way, the process on Ubuntu as well as on some other similar distributions is the same as that on Debian, so you can directly use the knowledge gained from this article for other distributions. A simple file called /etc/apt/sources.list.d/pgdg.list can be created, and a line for the PostgreSQL repository (all the following steps can be done as root user or using sudo) can be added: deb http://apt.postgresql.org/pub/repos/apt/ YOUR_DEBIAN_VERSION_HERE-pgdg main So, in the case of Debian Wheezy, the following line would be useful: deb http://apt.postgresql.org/pub/repos/apt/ wheezy-pgdg main Once we have added the repository, we can import the signing key: $# wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - OK Voilà! Things are mostly done. In the next step, the repository information can be updated: apt-get update Once this has been done successfully, it is time to install PostgreSQL: apt-get install "postgresql-9.4" If no error is issued by the operating system, it means you have successfully installed PostgreSQL. The beauty here is that PostgreSQL will fire up automatically after a restart. A simple database instance has also been created for you. If everything has worked as expected, you can give it a try and log in to the database: root@chantal:~# su - postgres $ psql postgres psql (9.4.1) Type "help" for help. postgres=# Memory and kernel issues After this brief introduction to installing PostgreSQL, it is time to focus on some of the most common problems. Fixing memory issues Some of the most important issues are related to the kernel and memory. Up to version 9.2, PostgreSQL was using the classical system V shared memory to cache data, store locks, and so on. Since PostgreSQL 9.3, things have changed, solving most issues people had been facing during installation. However, in PostgreSQL 9.2 or before, you might have faced the following error message: FATAL: Could not create shared memory segment DETAIL: Failed system call was shmget (key=5432001, size=1122263040, 03600) HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 1122263040 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections. If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for. The PostgreSQL documentation contains more information about shared memory configuration. If you are facing a message like this, it means that the kernel does not provide you with enough shared memory to satisfy your needs. Where does this need for shared memory come from? Back in the old days, PostgreSQL stored a lot of stuff, such as the I/O cache (shared_buffers, locks, autovacuum-related information and a lot more), in the shared memory. Traditionally, most Linux distributions have had a tight grip on the memory, and they don't issue large shared memory segments; for example, Red Hat has long limited the maximum amount of shared memory available to applications to 32 MB. For most applications, this is not enough to run PostgreSQL in a useful way—especially not if performance does matter (and it usually does). To fix this problem, you have to adjust kernel parameters. Managing Kernel Resources of the PostgreSQL Administrator's Guide will tell you exactly why we have to adjust kernel parameters. For more information, check out the PostgreSQL documentation at http://www.postgresql.org/docs/9.4/static/kernel-resources.htm. This article describes all the kernel parameters that are relevant to PostgreSQL. Note that every operating system needs slightly different values here (for open files, semaphores, and so on). Adjusting kernel parameters for Linux In this article, parameters relevant to Linux will be covered. If shmget (previously mentioned) fails, two parameters must be changed: $ sysctl -w kernel.shmmax=17179869184 $ sysctl -w kernel.shmall=4194304 In this example, shmmax and shmall have been adjusted to 16 GB. Note that shmmax is in bytes while shmall is in 4k blocks. The kernel will now provide you with a great deal of shared memory. Also, there is more; to handle concurrency, PostgreSQL needs something called semaphores. These semaphores are also provided by the operating system. The following kernel variables are available: SEMMNI: This is the maximum number of semaphore identifiers. It should be at least ceil((max_connections + autovacuum_max_workers + 4) / 16). SEMMNS: This is the maximum number of system-wide semaphores. It should be at least ceil((max_connections + autovacuum_max_workers + 4) / 16) * 17, and it should have room for other applications in addition to this. SEMMSL: This is the maximum number of semaphores per set. It should be at least 17. SEMMAP: This is the number of entries in the semaphore map. SEMVMX: This is the maximum value of the semaphore. It should be at least 1000. Don't change these variables unless you really have to. Changes can be made with sysctl, as was shown for the shared memory. Adjusting kernel parameters for Mac OS X If you happen to run Mac OS X and plan to run a large system, there are also some kernel parameters that need changes. Again, /etc/sysctl.conf has to be changed. Here is an example: kern.sysv.shmmax=4194304 kern.sysv.shmmin=1 kern.sysv.shmmni=32 kern.sysv.shmseg=8 kern.sysv.shmall=1024 Mac OS X is somewhat nasty to configure. The reason is that you have to set all five parameters to make this work. Otherwise, your changes will be silently ignored, and this can be really painful. In addition to that, it has to be assured that SHMMAX is an exact multiple of 4096. If it is not, trouble is near. If you want to change these parameters on the fly, recent versions of OS X provide a systcl command just like Linux. Here is how it works: sysctl -w kern.sysv.shmmax sysctl -w kern.sysv.shmmin sysctl -w kern.sysv.shmmni sysctl -w kern.sysv.shmseg sysctl -w kern.sysv.shmall Fixing other kernel-related limitations If you are planning to run a large-scale system, it can also be beneficial to raise the maximum number of open files allowed. To do that, /etc/security/limits.conf can be adapted, as shown in the next example: postgres   hard   nofile   1024 postgres   soft   nofile   1024 This example says that the postgres user can have up to 1,024 open files per session. Note that this is only important for large systems; open files won't hurt an average setup. Adding checksums to a database instance When PostgreSQL is installed, a so-called database instance is created. This step is performed by a program called initdb, which is a part of every PostgreSQL setup. Most binary packages will do this for you and you don't have to do this by hand. Why should you care then? If you happen to run a highly critical system, it could be worthwhile to add checksums to the database instance. What is the purpose of checksums? In many cases, it is assumed that crashes happen instantly—something blows up and a system fails. This is not always the case. In many scenarios, the problem starts silently. RAM may start to break, or the filesystem may start to develop slight corruption. When the problem surfaces, it may be too late. Checksums have been implemented to fight this very problem. Whenever a piece of data is written or read, the checksum is checked. If this is done, a problem can be detected as it develops. How can those checksums be enabled? All you have to do is to add -k to initdb (just change your init scripts to enable this during instance creation). Don't worry! The performance penalty of this feature can hardly be measured, so it is safe and fast to enable its functionality. Keep in mind that this feature can really help prevent problems at fairly low costs (especially when your I/O system is lousy). Preventing encoding-related issues Encoding-related problems are some of the most frequent problems that occur when people start with a fresh PostgreSQL setup. In PostgreSQL, every database in your instance has a specific encoding. One database might be en_US@UTF-8, while some other database might have been created as de_AT@UTF-8 (which denotes German as it is used in Austria). To figure out which encodings your database system is using, try to run psql -l from your Unix shell. What you will get is a list of all databases in the instance that include those encodings. So where can we actually expect trouble? Once a database has been created, many people would want to load data into the system. Let's assume that you are loading data into the aUTF-8 database. However, the data you are loading contains some ASCII characters such as ä, ö, and so on. The ASCII code for ö is 148. Binary 148 is not a valid Unicode character. In Unicode, U+00F6 is needed. Boom! Your import will fail and PostgreSQL will error out. If you are planning to load data into a new database, ensure that the encoding or character set of the data is the same as that of the database. Otherwise, you may face ugly surprises. To create a database using the correct locale, check out the syntax of CREATE DATABASE: test=# h CREATE DATABASE Command:     CREATE DATABASE Description: create a new database Syntax: CREATE DATABASE name    [ [ WITH ] [ OWNER [=] user_name ]            [ TEMPLATE [=] template ]            [ ENCODING [=] encoding ]            [ LC_COLLATE [=] lc_collate ]            [ LC_CTYPE [=] lc_ctype ]           [ TABLESPACE [=] tablespace_name ]            [ CONNECTION LIMIT [=] connlimit ] ] ENCODING and the LC* settings are used here to define the proper encoding for your new database. Avoiding template pollution It is somewhat important to understand what happens during the creation of a new database in your system. The most important point is that CREATE DATABASE (unless told otherwise) clones the template1 database, which is available in all PostgreSQL setups. This cloning has some important implications. If you have loaded a very large amount of data into template1, all of that will be copied every time you create a new database. In many cases, this is not really desirable but happens by mistake. People new to PostgreSQL sometimes put data into template1 because they don't know where else to place new tables and so on. The consequences can be disastrous. However, you can also use this common pitfall to your advantage. You can place the functions you want in all your databases in template1 (maybe for monitoring or whatever benefits). Killing the postmaster After PostgreSQL has been installed and started, many people wonder how to stop it. The most simplistic way is, of course, to use your service postgresql stop or /etc/init.d/postgresql stop init scripts. However, some administrators tend to be a bit crueler and use kill -9 to terminate PostgreSQL. In general, this is not really beneficial because it will cause some nasty side effects. Why is this so? The PostgreSQL architecture works like this: when you start PostgreSQL you are starting a process called postmaster. Whenever a new connection comes in, this postmaster forks and creates a so-called backend process (BE). This process is in charge of handling exactly one connection. In a working system, you might see hundreds of processes serving hundreds of users. The important thing here is that all of those processes are synchronized through some common chunk of memory (traditionally, shared memory, and in the more recent versions, mapped memory), and all of them have access to this chunk. What might happen if a database connection or any other process in the PostgreSQL infrastructure is killed with kill -9? A process modifying this common chunk of memory might die while making a change. The process killed cannot defend itself against the onslaught, so who can guarantee that the shared memory is not corrupted due to the interruption? This is exactly when the postmaster steps in. It ensures that one of these backend processes has died unexpectedly. To prevent the potential corruption from spreading, it kills every other database connection, goes into recovery mode, and fixes the database instance. Then new database connections are allowed again. While this makes a lot of sense, it can be quite disturbing to those users who are connected to the database system. Therefore, it is highly recommended not to use kill -9. A normal kill will be fine. Keep in mind that a kill -9 cannot corrupt your database instance, which will always start up again. However, it is pretty nasty to kick everybody out of the system just because of one process! Summary In this article we have learned how to install PostgreSQL using binary packages. Some of the most common problems and pitfalls, including encoding-related issues, checksums, and versioning were discussed. Resources for Article: Further resources on this subject: Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article] PostgreSQL – New Features [article]
Read more
  • 0
  • 0
  • 5976

article-image-ease-chaos-automated-patching
Packt
02 Apr 2013
19 min read
Save for later

Ease the Chaos with Automated Patching

Packt
02 Apr 2013
19 min read
(For more resources related to this topic, see here.) We have seen how the provisioning capabilities of the Oracle Enterprise Manager's Database Lifecycle Management (DBLM) Pack enable you to deploy fully patched Oracle Database homes and databases, as replicas of the gold copy in the Software Library of Enterprise Manager. However, nothing placed in production should be treated as static. Software changes in development cycles, enhancements take place, or security/functional issues are found. For almost anything in the IT world, new patches are bound to be released. These will also need to be applied to production, testing, reporting, staging, and development environments in the data center on an ongoing basis. For the database side of things, Oracle releases quarterly a combination of security fixes known as the Critical Patch Update (CPU). Other patches are bundled together and released every quarter in the form of a Patch Set Update (PSU), and this also includes the CPU for that quarter. Oracle strongly recommends applying either the PSU or the CPU every calendar quarter. If you prefer to apply the CPU, continue doing so. If you wish to move to the PSU, you can do so, but in that case continue only with the PSU. The quarterly patching requirement, as a direct recommendation from Oracle, is followed by many companies that prefer to have their databases secured with the latest security fixes. This underscores the importance of patching. However, if there are hundreds of development, testing, staging, and production databases in the data center to be patched, the situation quickly turns into a major manual exercise every three months. DBAs and their managers start planning for the patch exercise in advance, and a lot of resources are allocated to make it happen—with the administrators working on each database serially, at times overnight and at times over the weekend. There are a number of steps involved in patching each database, such as locating the appropriate patch in My Oracle Support (MOS), downloading the patch, transferring it to each of the target servers, upgrading the OPATCH facility in each Oracle home, shutting down the databases and listeners running from that home, applying the patch, starting each of the databases in restricted mode, applying any supplied SQL scripts, restarting the databases in normal mode, and checking the patch inventory. These steps have to be manually repeated on every database home on every server, and on every database in that home. Dull repetition of these steps in patching the hundreds of servers in a data center is a very monotonous task, and it can lead to an increase in human errors. To avoid these issues inherent in manual patching, some companies decide not to apply the quarterly patches on their databases. They wait for a year, or a couple of years before they consider patching, and some even prefer to apply year-old patches instead of the latest patches. This is counter-productive and leads to their databases being insecure and vulnerable to attacks, since the latest recommended CPUs from Oracle have not been applied. What then is the solution, to convince these companies to apply patches regularly? If the patching process can be mostly automated (but still under the control of the DBAs), it would reduce the quarterly patching effort to a great extent. Companies would then have the confidence that their existing team of DBAs would be able to manage the patching of hundreds of databases in a controlled and automated manner, keeping human error to a minimum. The Database Lifecycle Management Pack of Enterprise Manager Cloud Control 12c is able to achieve this by using its Patch Automation capability. We will now look into Patch Automation and the close integration of Enterprise Manager with My Oracle Support. Recommended patches By navigating to Enterprise | Summary, a Patch Recommendations section will be visible in the lower left-hand corner, as shown in the following screenshot: The graph displays either the Classification output of the recommended patches, or the Target Type output. Currently for this system, more than five security patches are recommended as can be seen in this graph. This recommendation has been derived via a connection to My Oracle Support (the OMS can be connected either directly to the Internet, or by using a proxy server). Target configuration information is collected by the Enterprise Manager Agent and is stored in the Configuration Management Database (CMDB) within the repository. This configuration information is collated regularly by the Enterprise Manager's Harvester process and pushed to My Oracle Support. Thus, configuration information about your targets is known to My Oracle Support, and it is able to recommend appropriate patches as and when they are released. However, the recommended patch engine also runs within Enterprise Manager 12c at your site, working off the configuration data in the CMDB in Enterprise Manager, so recommendations can in fact be achieved without the configuration having been uploaded on MOS by the Harvester (this upload is more useful now for other purposes, such as attaching configuration details during SR creation). It is also possible to get metadata about the latest available patches from My Oracle Support in offline mode, but more manual steps are involved in this case, so Internet connectivity is recommended to get the full benefits of Enterprise Manager's integration with My Oracle Support. To view the details about the patches, click on the All Recommendations link or on the graph itself. This connects to My Oracle Support (you may be asked to log in to your company-specific MOS account) and brings up the list of the patches in the Patch Recommendations section. The database (and other types of) targets managed by the Enterprise Manager system are displayed on the screen, along with the recommended CPU (or other) patches. We select the CPU July patch for our saiprod database. This displays the details about the patch in the section in the lower part of the screen. We can see the list Bugs Resolved by This Patch, the Last Updated date and Size of the patch and also Read Me—which has important information about the patch. The number of total downloads for this patch is visible, as is the Community Discussion on this patch in the Oracle forums. You can add your own comment for this patch, if required, by selecting Reply to the Discussion. Thus, at a glance, you can find out how popular the patch is (number of downloads) and any experience of other Oracle DBAs regarding this patch—whether positive or negative. Patch plan You can view the information about the patch by clicking on the Full Screen button. You can download the patch either to the Software Library in Enterprise Manager or to your desktop. Finally, you can directly add this patch to a new or existing patch plan, which we will do next. Go to Add to Plan | Add to New, and enter Plan Name as Sainath_patchplan. Then click on Create Plan. If you would like to add multiple patches to the plan, select both the patches first and then add to the plan. (You can also add patches later to the plan). After the plan is created, click on View Plan. This brings up the following screen: A patch plan is nothing but a collection of patches that can be applied as a group to one or more targets. On the Create Plan page that appears, there are five steps that can be seen in the left-hand pane. By default, the second step appears first. In this step, you can see all the patches that have been added to the plan. It is possible to include more patches by clicking on the Add Patch... button. Besides the ability to manually add a patch to this list, the analysis process may also result in additional patches being added to the plan. If you click on the first step, Plan Information, you can put in a description for this plan. You can also change the plan permissions, either Full or View, for various Enterprise Manager roles. Note that the Full permission allows the role to validate the plan, however, the View permission does not allow validation. Move to step 3, Deployment Options. The following screen appears. Out-of-place patching A new mechanism for patching has been provided in the Enterprise Manager Cloud Control 12c version, known as out-of-place patching. This is now the recommended method and creates a new Oracle home which is then patched while the previous home is still operational. All this is done using an out of the box deployment procedure in Enterprise Manager. Using this mechanism means that the only downtime will take place when the databases from the previous home are switched to run from the new home. If there is any issue with the database patch, you can switch back to the previous unpatched home since it is still available. So, patch rollback is a lot faster. Also, if there are multiple databases running in the previous home, you can decide which ones to switch to the new patched home. This is obviously an advantage, otherwise you would be forced to simultaneously patch all the databases in a home. A disadvantage of this method would be the space requirements for a duplicate home. Also, if proper housekeeping is not carried out later on, it can lead to a proliferation of Oracle homes on a server where patches are being applied regularly using this mechanism. This kind of selective patching and minimal downtime is not possible if you use the previously available method of in-place patching, which uses a separate deployment procedure to shut down all databases running from an Oracle home before applying the patches on the same home. The databases can only be restarted normally after the patching process is over, and this obviously takes more downtime and affects all databases in a home. Depending on the method you choose, the appropriate deployment procedure will be automatically selected and used. We will now use the out-of-place method in this patch plan. On the Step 3: Deployment Options page, make sure the Out of Place (Recommended) option is selected. Then click on Create New Location. Type in the name and location of the new Oracle home, and click on the Validate button. This checks the Oracle home path on the Target server. After this is done, click on the Create button. The deployment options of the patch plan are successfully updated, and the new home appears on the Step 3 page. Click on the Credentials tab. Here you need to select or enter the normal and privileged credentials for the Oracle home. Click on the Next button. This moves us to step 4, the Validation step. Pre-patching analysis Click on the Analyze button. A job to perform prepatching analysis is started in the background. This will compare the installed software and patches on the targets with the new patches you have selected in your plan, and attempt to validate them. This validation may take a few minutes to complete, since it also checks the Oracle home for readiness, computes the space requirements for the home, and conducts other checks such as cluster node connectivity (if you are patching a RAC database). If you drill down to the analysis job itself by clicking on Show Detailed Progress here, you can see that it does a number of checks to validate if the targets are supported for patching, verifies the normal and super user credentials of the Oracle home, verifies the target tools, commands, and permissions, upgrades OPATCH to the latest version, stages the selected patches to Oracle homes, and then runs the prerequisite checks including those for cloning an Oracle home. If the prerequisite checks succeed, the analysis job skips the remaining steps and stops at this point with a successful status. The patch is seen as Ready for Deployment. If there are any issues, they will show up at this point. For example, if there is a conflict with any of the patches, a replacement patch or a merge patch may be suggested. If there is no replacement or merge patch and you want to request such a patch, it will allow you to make the request directly from the screen. If you are applying a PSU and the CPU for that same release is already applied to the Oracle home, for example, July 2011 CPU, then because the PSU is a superset of the CPU, the MOS analysis will stop and mention that the existing patch fixes the issues. Such a message can be seen in the Informational Messages section of the Validation page. Deployment In our case, the patch is Ready for Deployment. At this point, you can move directly to step 5, Review & Deploy, by clicking on it in the left-hand side pane. On the Review & Deploy page, the patch plan is described in detail along with Impacted Targets. Along with the database that is in the patch plan, a new impacted target has been found by the analysis process and added to the list of impacted targets. This is the listener that is running from the home that is to be cloned and patched. The patches that are to be applied are also listed on this review page, in our case the CPUJUL2011 patch is shown with the status Conflict Free. The deployment procedure that will be used is Clone and Patch Oracle Database, since out-of-place patching is being used, and all instances and listeners running in the previous Oracle home are being switched to the new home. Click on the Prepare button. The status on the screen changes to Preparation in Progress. A job for preparation of the out-of-place patching starts, including cloning of the original Oracle home and applying the patches to the cloned home. No downtime is required while this job is running; it can happen in the background. This preparation phase is like a pre-deploy and is only possible in the case of out-of-place patching, whereas in the case of in-place patching, there is no Prepare button and you deploy straightaway. Clicking on Show Detailed Progress here opens a new window showing the job details. When the preparation job has successfully completed (after about two hours in our virtual machine), we can see that it performs the cloning of the Oracle home, applies the patches on the new home, validates the patches, runs the post patch scripts, and then skips all the remaining steps. It also collects target properties for the Oracle home in order to refresh the configurations in Enterprise Manager. The Review & Deploy page now shows Preparation Successful!. The plan is now ready to be deployed. Click on the Deploy button. The status on the screen changes to Deployment in Progress. A job for deployment of the out-of-place patching starts. At this time, downtime will be required since the database instances using the previous Oracle home will be shut down and switched across. The deploy job successfully completes (after about 21 minutes in our virtual machine); we can see that it works iteratively over the list of hosts and Oracle homes in the patch plan. It starts a blackout for the database instances in the Oracle home (so that no alerts are raised), stops the instances, migrates them to the cloned Oracle home, starts them in upgrade mode, applies SQL scripts to patch the instance, applies post-SQL scripts, and then restarts the database in normal mode. The deploy job applies other SQL scripts and recompiles invalid objects (except in the case of patch sets). It then migrates the listener from the previous Oracle home using the Network Configuration Assistant (NetCA), updates the Target properties, stops the blackout, and detaches the previous Oracle home. Finally, the configuration information of the cloned Oracle home is refreshed. The Review & Deploy page of the patch plan now shows the status of Deployment Successful!, as can be seen in the following screenshot: Plan template On the Deployment Successful page, it is possible to click on Save as Template at the bottom of the screen in order to save a patch plan as a plan template. The patch plan should be successfully analyzed and deployable, or successfully deployed, before it can be saved as a template. The plan template, when thus created, will not have any targets included, and such a template can then be used to apply the successful patch plan to multiple other targets. Inside the plan template, the Create Plan button is used to create a new plan based on this template, and this can be done repeatedly for multiple targets. Go to Enterprise | Provisioning and Patching | Patches & Updates; this screen displays a list of all the patch plans and plan templates that have been created. The successfully deployed Sainath_patchplan and the new patch plan template also shows up here. To see a list of the saved patches in the Software Library, go to Enterprise | Provisioning and Patching | Saved Patches. This brings up the following screen: This page also allows you to manually upload patches to the Software Library. This scenario is mostly used when there is no connection to the Internet (either direct or via a proxy server) from the Enterprise Manager OMS servers, and consequently you need to download the patches manually. For more details on setting up the offline mode and downloading the patch recommendations and latest patch information in the form of XML files from My Oracle Support, please refer to Oracle Enterprise Manager Lifecycle Management Administrator's Guide 12c Release 2 (12.1.0.2) at the following URL: http://docs.oracle.com/cd/E24628_01/em.121/e27046/pat_mosem_new. htm#BABBIEAI Patching roles The new version of Enterprise Manager Cloud Control 12c supplies out of the box administrator roles specifically for patching. These roles are EM_PATCH_ ADMINISTRATOR, EM_PATCH_DESIGNER, and EM_PATCH_OPERATOR. You need to grant these roles to appropriate administrators. Move to Setup | Security | Roles. On this page, search for the roles specifically meant for patching. The three roles appear as follows: The EM_PATCH_ADMINISTRATOR role can create, edit, deploy, or delete any patch plan and can also grant privileges to other administrators after creating them. This role has full privileges on any patch plan or patch template in the Enterprise Manager system and maintains the patching infrastructure. The EM_PATCH_DESIGNER role normally identifies patches to be used in the patching cycle across development, testing, and production. This role would be the one of the senior DBA in real life. The patch designer creates patch plans and plan templates, and grants privileges for these plan templates to the EM_PATCH_ OPERATOR role. As an example, the patch designer will select a set of recommended and other manually selected patches for an Oracle 11g database and create a patch plan. This role will then test the patching process in a development environment, and save the successfully analyzed or deployed patch plan as a plan template. The patch designer will then publish the Oracle 11g database patching plan template to the patch operator—probably the junior DBA or application DBA in real life. Next, the patch operator creates new patch plans using the template (but cannot create a template), and adds a different list of targets, such as other Oracle 11g databases in the test, staging, or production environment. This role then schedules the deployment of the patches to all these environments—using the same template again and again. Summary: Enterprise Manager Cloud Control 12 c allows automation of the tedious patching procedure used in many organizations today, to patch their Oracle databases and servers. This is achieved via the Database Lifecycle Management Pack, which is one of the main licensable packs of Enterprise Manager. Sophisticated Deployment Procedures are provided out of the box to fulfill many different types of patching tasks, and this helps you to achieve mass patching of multiple targets with multiple patches in a fully automated manner, thus making tremendous savings in administrative time and effort. Some companies have estimated savings of up to 98 percent in patching tasks in their data centers. Different types of patches can be applied in this manner, including CPUs, PSUs, Patch sets and other one-off patches. Different versions of databases are supported, such as 9i, 10 g and 11 g. For the first time, the upgrade of single-instance databases is also possible via Enterprise Manager Cloud Control 12c. There is full integration of the patching capabilities of Enterprise Manager with My Oracle Support (MOS). The support site retains the configuration of all the components managed by Enterprise Manager inside the company. Since the current version and patch information of the components is known, My Oracle Support is able to provide appropriate patch recommendations for many targets, including the latest security fixes. This ensures that the company is up to date with regards to security protection. A full division of roles is available, such as Patch Administrator, Designer, and Operator. It is possible to take the My Oracle Support recommendations, select patches for targets, put them into a patch plan, deploy the patch plan and then create a plan template from it. The template can then be published to any operator who can then create their own patch plans for other targets. In this way patching can be tested, verified, and then pushed to production. In all, Enterprise Manager Cloud Control 12 c offers valuable automation methods for Mass Patching, allowing Administrators to ensure that their systems have the latest security patches, and enabling them to control the application of patches on development, test, and production servers from the centralized location of the Software Library. Resources for Article : Further resources on this subject: Author Podcast - Bob Griesemer on Oracle Warehouse Builder 11g [Article] Managing Oracle Business Intelligence [Article] Author Podcast - Ronald Rood discusses the birth of Oracle Scheduler [Article]
Read more
  • 0
  • 0
  • 5969
Modal Close icon
Modal Close icon