Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-regular-expressions-python-26-text-processing
Packt
13 Jan 2011
17 min read
Save for later

Regular Expressions in Python 2.6 Text Processing

Packt
13 Jan 2011
17 min read
Python 2.6 Text Processing: Beginners Guide Simple string matching Regular expressions are notoriously hard to read, especially if you're not familiar with the obscure syntax. For that reason, let's start simple and look at some easy regular expressions at the most basic level. Before we begin, remember that Python raw strings allow us to include backslashes without the need for additional escaping. Whenever you define regular expressions, you should do so using the raw string syntax. Time for action – testing an HTTP URL In this example, we'll check values as they're entered via the command line as a means to introduce the technology. We'll dive deeper into regular expressions as we move forward. We'll be scanning URLs to ensure our end users inputted valid data. Create a new file and name it number_regex.py. Enter the following code: import sys import re # Make sure we have a single URL argument. if len(sys.argv) != 2: print >>sys.stderr, "URL Required" sys.exit(-1) # Easier access. url = sys.argv[1] # Ensure we were passed a somewhat valid URL. # This is a superficial test. if re.match(r'^https?:/{2}w.+$', url): print "This looks valid" else: print "This looks invalid" Now, run the example script on the command line a few times, passing various different values to it on the command line. (text_processing)$ python url_regex.py http://www.jmcneil.net This looks valid (text_processing)$ python url_regex.py http://intranet This looks valid (text_processing)$ python url_regex.py http://www.packtpub.com This looks valid (text_processing)$ python url_regex.py https://store This looks valid (text_processing)$ python url_regex.py httpsstore This looks invalid (text_processing)$ python url_regex.py https:??store This looks invalid (text_processing)$ What just happened? We took a look at a very simple pattern and introduced you to the plumbing needed to perform a match test. Let's walk through this little example, skipping the boilerplate code. First of all, we imported the re module. The re module, as you probably inferred from the name, contains all of Python's regular expression support. Any time you need to work with regular expressions, you'll need to import the re module. Next, we read a URL from the command line and bind a temporary attribute, which makes for cleaner code. Directly below that, you should notice a line that reads re.match(r'^https?:/{2}w.+$', url). This line checks to determine whether the string referenced by the url attribute matches the ^https?:/{2}w.+$ pattern. If a match is found, we'll print a success message; otherwise, the end user would receive some negative feedback indicating that the input value is incorrect. This example leaves out a lot of details regarding HTTP URL formats. If you were performing validation on user input, one place to look would be http://formencode.org/. FormEncode is a HTML form-processing and data-validation framework written by Ian Bicking. Understanding the match function The most basic method of testing for a match is via the re.match function, as we did in the previous example. The match function takes a regular expression pattern and a string value. For example, consider the following snippet of code: Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'pattern', 'pattern') <_sre.SRE_Match object at 0x1004811d0> >>> Here, we simply passed a regular expression of "pattern" and a string literal of "pattern" to the re.match function. As they were identical, the result was a match. The returned Match object indicates the match was successful. The re.match function returns None otherwise. >>> re.match(r'pattern', 'failure') >>> Learning basic syntax A regular expression is generally a collection of literal string data and special metacharacters that represents a pattern of text. The simplest regular expression is just literal text that only matches itself. In addition to literal text, there are a series of special characters that can be used to convey additional meaning, such as repetition, sets, wildcards, and anchors. Generally, the punctuation characters field this responsibility. Detecting repetition When building up expressions, it's useful to be able to match certain repeating patterns without needing to duplicate values. It's also beneficial to perform conditional matches. This lets us check for content such as "match the letter a, followed by the number one at least three times, but no more than seven times." For example, the code below does just that: Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'^a1{3,7}$', 'a1111111') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^a1{3,7}$', '1111111') >>> If the repetition operator follows a valid regular expression enclosed in parenthesis, it will perform repetition on that entire expression. For example: >>> re.match(r'^(a1){3,7}$', 'a1a1a1') <_sre.SRE_Match object at 0x100493918> >>> re.match(r'^(a1){3,7}$', 'a11111') >>> The following table details all of the special characters that can be used for marking repeating values within a regular expression. Specifying character sets and classes In some circumstances, it's useful to collect groups of characters into a set such that any of the values in the set will trigger a match. It's also useful to match any character at all. The dot operator does just that. A character set is enclosed within standard square brackets. A set defines a series of alternating (or) entities that will match a given text value. If the first character within a set is a caret (^) then a negation is performed. All characters not defined by that set would then match. There are a couple of additional interesting set properties. For ranged values, it's possible to specify an entire selection using a hyphen. For example, '[0-6a-d]' would match all values between 0 and 6, and a and d. Special characters listed within brackets lose their special meaning. The exceptions to this rule are the hyphen and the closing bracket. If you need to include a closing bracket or a hyphen within a regular expression, you can either place them as the first elements in the set or escape them by preceding them with a backslash. As an example, consider the following snippet, which matches a string containing a hexadecimal number. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'^0x[a-f0-9]+$', '0xff') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^0x[a-f0-9]+$', '0x01') <_sre.SRE_Match object at 0x1004816b0> >>> re.match(r'^0x[a-f0-9]+$', '0xz') >>> In addition to the bracket notation, Python ships with some predefined classes. Generally, these are letter values prefixed with a backslash escape. When they appear within a set, the set includes all values for which they'll match. The d escape matches all digit values. It would have been possible to write the above example in a slightly more compact manner. >>> re.match(r'^0x[a-fd]+$', '0x33') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^0x[a-fd]+$', '0x3f') <_sre.SRE_Match object at 0x1004816b0> >>> The following table outlines all of the character sets and classes available: One thing that should become apparent is that lowercase classes are matches whereas their uppercase counterparts are the inverse. Applying anchors to restrict matches There are times where it's important that patterns match at a certain position within a string of text. Why is this important? Consider a simple number validation test. If a user enters a digit, but mistakenly includes a trailing letter, an expression checking for the existence of a digit alone will pass. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'd', '1f') <_sre.SRE_Match object at 0x1004811d0> >>> Well, that's unexpected. The regular expression engine sees the leading '1' and considers it a match. It disregards the rest of the string as we've not instructed it to do anything else with it. To fix the problem that we have just seen, we need to apply anchors. >>> re.match(r'^d$', '6') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^d$', '6f') >>> Now, attempting to sneak in a non-digit character results in no match. By preceding our expression with a caret (^) and terminating it with a dollar sign ($), we effectively said "between the start and the end of this string, there can only be one digit." Anchors, among various other metacharacters, are considered zero-width matches. Basically, this means that a match doesn't advance the regular expression engine within the test string. We're not limited to the either end of a string, either. Here's a collection of all of the available anchors provided by Python. Wrapping it up Now that we've covered the basics of regular expression syntax, let's double back and take a look at the expression we used in our first example. It might be a bit easier if we break it down a bit more with a diagram. Now that we've provided a bit of background, this pattern should make sense. We begin the regular expression with a caret, which matches the beginning of the string. The very next element is the literal http. As our caret matches the start of a string and must be immediately followed by http, this is equivalent to saying that our string must start with http. Next, we include a question mark after the s in https. The question mark states that the previous entity should be matched either zero, or one time. By default, the evaluation engine is looking character-by-character, so the previous entity in this case is simply "s." We do this so our test passes for both secure and non-secure addresses. As we advanced forward in our string, the next special term we run into is {2}, and it follows a simple forward slash. This says that the forward slash should appear exactly two times. Now, in the real world, it would probably make more sense to simply type the second slash. Using the repetition check like this not only requires more typing, but it also causes the regular expression engine to work harder. Immediately after the repetition match, we include a w. The w, if you'll remember from the previous tables, expands to [0-9a-zA-Z_], or any word character. This is to ensure that our URL doesn't begin with a special character. The dot character after the w matches anything, except a new line. Essentially, we're saying "match anything else, we don't so much care." The plus sign states that the preceding wild card should match at least once. Finally, we're anchoring the end of the string. However, in this example, this isn't really necessary. Have a go hero – tidying up our URL test There are a few intentional inconsistencies and problems with this regular expression as designed. To name a few: Properly formatted URLs should only contain a few special characters. Other values should be URL-encoded using percent escapes. This regular expression doesn't check for that. It's possible to include newline characters towards the end of the URL, which is clearly not supported by any browsers! The w followed by the. + implicitly set a minimum limit of two characters after the protocol specification. A single letter is perfectly valid. You guessed it. Using what we've covered thus far, it should be possible for you to backtrack and update our regular expression in order to fix these flaws. For more information on what characters are allowed, have a look at http://www.w3schools.com/tags/ref_urlencode.asp. Advanced pattern matching In addition to basic pattern matching, regular expressions let us handle some more advanced situations as well. It's possible to group characters for purposes of precedence and reference, perform conditional checks based on what exists later, or previously, in a string, and limit exactly how much of a match actually constitutes a match. Don't worry; we'll clarify that last phrase as we move on. Let's go! Grouping When crafting a regular expression string, there are generally two reasons you would wish to group expression components together: entity precedence or to enable access to matched parts later in your application. Time for action – regular expression grouping In this example, we'll return to our LogProcessing application. Here, we'll update our log split routines to divide lines up via a regular expression as opposed to simple string manipulation. In core.py, add an import re statement to the top of the file. This makes the regular expression engine available to us. Directly above the __init__ method definition for LogProcessor, add the following lines of code. These have been split to avoid wrapping. _re = re.compile( r'^([d.]+) (S+) (S+) [([w/:+ ]+)] "(.+?)" ' r'(?P<rcode>d{3}) (S+) "(S+)" "(.+)"') Now, we're going to replace the split method with one that takes advantage of the new regular expression: def split(self, line): """ Split a logfile. Uses a simple regular expression to parse out the Apache logfile entries. """ line = line.strip() match = re.match(self._re, line) if not match: raise ParsingError("Malformed line: " + line) return { 'size': 0 if match.group(6) == '-' else int(match.group(6)), 'status': match.group('rcode'), 'file_requested': match.group(5).split()[1] } Running the logscan application should now produce the same output as it did when we were using a more basic, split-based approach. (text_processing)$ cat example3.log | logscan -c logscan.cfg What just happened? First of all, we imported the re module so that we have access to Python's regular expression services. Next, at the LogProcessor class level, we defined a regular expression. Though, this time we did so via re.compile rather than a simple string. Regular expressions that are used more than a handful of times should be "prepared" by running them through re.compile first. This eases the load placed on the system by frequently used patterns. The re.compile function returns a SRE_Pattern object that can be passed in just about anywhere you can pass in a regular expression. We then replace our split method to take advantage of regular expressions. As you can see, we simply pass self._re in as opposed to a string-based regular expression. If we don't have a match, we raise a ParsingError, which bubbles up and generates an appropriate error message, much like we would see on an invalid split case. Now, the end of the split method probably looks somewhat peculiar to you. Here, we've referenced our matched values via group identification mechanisms rather than by their list index into the split results. Regular expression components surrounded by parenthesis create a group, which can be accessed via the group method on the Match object later down the road. It's also possible to access a previously matched group from within the same regular expression. Let's look at a somewhat smaller example. >>> match = re.match(r'(0x[0-9a-f]+) (?P<two>1)', '0xff 0xff') >>> match.group(1) '0xff' >>> match.group(2) '0xff' >>> match.group('two') '0xff' >>> match.group('failure') Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: no such group >>> Here, we surround two distinct regular expressions components with parenthesis, (0x[0-9a-f]+), and (?P&lttwo>1). The first regular expression matches a hexadecimal number. This becomes group ID 1. The second expression matches whatever was found by the first, via the use of the 1. The "backslash-one" syntax references the first match. So, this entire regular expression only matches when we repeat the same hexadecimal number twice, separated with a space. The ?P&lttwo> syntax is detailed below. As you can see, the match is referenced after-the-fact using the match.group method, which takes a numeric index as its argument. Using standard regular expressions, you'll need to refer to a matched group using its index number. However, if you'll look at the second group, we added a (?P&ltname>) construct. This is a Python extension that lets us refer to groupings by name, rather than by numeric group ID. The result is that we can reference groups of this type by name as opposed to simple numbers. Finally, if an invalid group ID is passed in, an IndexError exception is thrown. The following table outlines the characters used for building groups within a Python regular expression: Finally, it's worth pointing out that parenthesis can also be used to alter priority as well. For example, consider this code. >>> re.match(r'abc{2}', 'abcc') <_sre.SRE_Match object at 0x1004818b8> >>> re.match(r'a(bc){2}', 'abcc') >>> re.match(r'a(bc){2}', 'abcbc') <_sre.SRE_Match object at 0x1004937b0> >>> Whereas the first example matches c exactly two times, the second and third line require us to repeat bc twice. This changes the meaning of the regular expression from "repeat the previous character twice" to "repeat the previous match within parenthesis twice." The value within the group could have been its own complex regular expression, such as a([b-c]) {2}. Have a go hero – updating our stats processor to use named groups Spend a couple of minutes and update our statistics processor to use named groups rather than integer-based references. This makes it slightly easier to read the assignment code in the split method. You do not need to create names for all of the groups, simply the ones we're actually using will do. Using greedy versus non-greedy operators Regular expressions generally like to match as much text as possible before giving up or yielding to the next token in a pattern string. If that behavior is unexpected and not fully understood, it can be difficult to get your regular expression correct. Let's take a look at a small code sample to illustrate the point. Suppose that with your newfound knowledge of regular expressions, you decided to write a small script to remove the angled brackets surrounding HTML tags. You might be tempted to do it like this: >>> match = re.match(r'(?P<tag><.+>)', '<title>Web Page</title>') >>> match.group('tag') '<title>Web Page</title>' >>> The result is probably not what you expected. The reason we got this result was due to the fact that regular expressions are greedy by nature. That is, they'll attempt to match as much as possible. If you look closely, &lttitle> is a match for the supplied regular expression, as is the entire &lttitle&gtWeb Page</title> string. Both start with an angled-bracket, contain at least one character, and both end with an angled bracket. The fix is to insert the question mark character, or the non-greedy operator, directly after the repetition specification. So, the following code snippet fixes the problem. >>> match = re.match(r'(?P<tag><.+?>)', '<title>Web Page</title>') >>> match.group('tag') '<title>' >>> The question mark changes our meaning from "match as much as you possibly can" to "match only the minimum required to actually match."
Read more
  • 0
  • 0
  • 6212

article-image-microsoft-sql-server-2008-high-availability-installing-database-mirroring
Packt
11 Jan 2011
4 min read
Save for later

Microsoft SQL Server 2008 High Availability: Installing Database Mirroring

Packt
11 Jan 2011
4 min read
  Microsoft SQL Server 2008 High Availability Minimize downtime, speed up recovery, and achieve the highest level of availability and reliability for SQL server applications by mastering the concepts of database mirroring,log shipping,clustering, and replication  Install various SQL Server High Availability options in a step-by-step manner  A guide to SQL Server High Availability for DBA aspirants, proficient developers and system administrators  Learn the pre and post installation concepts and common issues you come across while working on SQL Server High Availability  Tips to enhance performance with SQL Server High Availability  External references for further study         Introduction First let's briefly see what is Database Mirroring. Database Mirroring is an option that can be used to cater to the business need, in order to increase the availability of SQL Server database as standby, for it to be used as an alternate production server in the case of any emergency. As its name suggests, mirroring stands for making an exact copy of the data. Mirroring can be done onto a disk, website, or somewhere else. Now let's move on to the topic of this article – installation of Database Mirroring. Preparing for Database Mirroring Before we move forward, we shall prepare the database for the Database Mirroring. Here are the steps: The first step is to ensure that the database is in Full Recovery mode. You can set the mode to “Full Recovery” using the following code: Execute the backup command, followed by the transaction log backup command, and move the backups to the server we wish to have as a mirror. I have run the RESTORE VERIFYONLY command after backup completes. This command ensures the validity of a backup file. It is recommended to always verify the backup. As we have a full database and log backup file, move them over to the Mirror server that we have identified. We will now perform the database restoration, followed by the restore log command with NORECOVERY. It is necessary to use the NORECOVERY option so that additional log backups or transactions can be applied. Installing Database Mirroring As the database that we want to participate in the Database Mirroring is now ready, we can move on with the actual installation process. Right-click on the database we want to mirror and select Tasks | Mirror..... It will open the following screen. To start with the actual setup, click on the Configure Security... button. In this dialog box, select the No option as we are not including the Witness Server at this stage and will be performing this task later. In the next dialog box, connect to the Principal Server. Specify the Listener Port and Endpoint name, and click Next. We are now asked to configure the property for the Mirror Server, Listener port, and Endpoint name. In this step, the installation wizard asks us to specify the service account that will be used by the Database Mirroring operation. If a person is using local system account as a service account, he/she must use Certificates for authentication. Generally, these certificates are used by the websites to assure their users that the information is secured. Certificates are the digital documents that store digital signature or identity information of the holder for authenticity purpose. They ensure that every byte of information being sent over the internet/intranet/vpn, and stored at the server, is safe. Certificates are installed at the servers, either by obtaining them from the providers such as http://www.thwate.com or can be self-issued by Database Administrator or Chief Information Officer of the company using the httpcfg.exe utility. The same is true for SQL Server. SQL Server uses certificates to ensure that the information is secured and these certificates can be issued by self, using httpcfg.exe, or can be obtained from issuing authority. In the next dialog box, make sure that the configuration details we have furnished are valid. Ensure that the name of the Principal and Mirror Server, Endpoints, and port number are correct. Click Finish. Ensure that the setup wizard returns a success report at the end.
Read more
  • 0
  • 0
  • 2212

article-image-getting-started-python-26-text-processing
Packt
10 Jan 2011
14 min read
Save for later

Getting Started with Python 2.6 Text Processing

Packt
10 Jan 2011
14 min read
  Python 2.6 Text Processing: Beginners Guide The easiest way to learn how to manipulate text with Python The easiest way to learn text processing with Python Deals with the most important textual data formats you will encounter Learn to use the most popular text processing libraries available for Python Packed with examples to guide you through         Read more about this book       (For more resources on this subject, see here.) Categorizing types of text data Textual data comes in a variety of formats. For our purposes, we'll categorize text into three very broad groups. Isolating down into segments helps us to understand the problem a bit better, and subsequently choose a parsing approach. Each one of these sweeping groups can be further broken down into more detailed chunks. One thing to remember when working your way through the book is that text content isn't limited to the Latin alphabet. This is especially true when dealing with data acquired via the Internet. Providing information through markup Structured text includes formats such as XML and HTML. These formats generally consist of text content surrounded by special symbols or markers that give extra meaning to a file's contents. These additional tags are usually meant to convey information to the processing application and to arrange information in a tree-like structure. Markup allows a developer to define his or her own data structure, yet rely on standardized parsers to extract elements. For example, consider the following contrived HTML document. <html> <head> <title>Hello, World!</title> </head> <body> <p> Hi there, all of you earthlings. </p> <p> Take us to your leader. </p> </body></html> In this example, our document's title is clearly identified because it is surrounded by opening and closing &lttitle> and </title> elements. Note that although the document's tags give each element a meaning, it's still up to the application developer to understand what to do with a title object or a p element. Notice that while it still has meaning to us humans, it is also laid out in such a way as to make it computer friendly. One interesting aspect to these formats is that it's possible to embed references to validation rules as well as the actual document structure. This is a nice benefit in that we're able to rely on the parser to perform markup validation for us. This makes our job much easier as it's possible to trust that the input structure is valid. Meaning through structured formats Text data that falls into this category includes things such as configuration files, marker delimited data, e-mail message text, and JavaScript Object Notation web data. Content within this second category does not contain explicit markup much like XML and HTML does, but the structure and formatting is required as it conveys meaning and information about the text to the parsing application. For example, consider the format of a Windows INI file or a Linux system's /etc/hosts file. There are no tags, but the column on the left clearly means something other than the column on the right. Python provides a collection of modules and libraries intended to help us handle popular formats from this category. Understanding freeform content This category contains data that does not fall into the previous two groupings. This describes e-mail message content, letters, article copy, and other unstructured character-based content. However, this is where we'll largely have to look at building our own processing components. There are external packages available to us if we wish to perform common functions. Some examples include full text searching and more advanced natural language processing. Ensuring you have Python installed Our first order of business is to ensure that you have Python installed. We'll be working with Python 2.6 and we assume that you're using that same version. If there are any drastic differences in earlier releases, we'll make a note of them as we go along. All of the examples should still function properly with Python 2.4 and later versions. If you don't have Python installed, you can download the latest 2.X version from http://www.python.org. Most Linux distributions, as well as Mac OS, usually have a version of Python preinstalled. At the time of this writing, Python 2.6 was the latest version available, while 2.7 was in an alpha state. Providing support for Python 3 The examples in this book are written for Python 2. However, wherever possible, we will provide code that has already been ported to Python 3. You can find the Python 3 code in the Python3 directories in the code bundle available on the Packt Publishing FTP site. Unfortunately, we can't promise that all of the third-party libraries that we'll use will support Python 3. The Python community is working hard to port popular modules to version 3.0. However, as the versions are incompatible, there is a lot of work remaining. In situations where we cannot provide example code, we'll note this. Implementing a simple cipher Let's get going early here and implement our first script to get a feel for what's in store. A Caesar Cipher is a simple form of cryptography in which each letter of the alphabet is shifted down by a number of letters. They're generally of no cryptographic use when applied alone, but they do have some valid applications when paired with more advanced techniques. This preceding diagram depicts a cipher with an offset of three. Any X found in the source data would simply become an A in the output data. Likewise, any A found in the input data would become a D. Time for action – implementing a ROT13 encoder The most popular implementation of this system is ROT13. As its name suggests, ROT13 shifts – or rotates – each letter by 13 spaces to produce an encrypted result. As the English alphabet has 26 letters, we simply run it a second time on the encrypted text in order to get back to our original result. Let's implement a simple version of that algorithm. Start your favorite text editor and create a new Python source file. Save it as rot13.py. Enter the following code exactly as you see it below and save the file. import sysimport stringCHAR_MAP = dict(zip( string.ascii_lowercase, string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13] ))def rotate13_letter(letter): """ Return the 13-char rotation of a letter. """ do_upper = False if letter.isupper(): do_upper = True letter = letter.lower() if letter not in CHAR_MAP: return letter else: letter = CHAR_MAP[letter] if do_upper: letter = letter.upper() return letterif __name__ == '__main__': for char in sys.argv[1]: sys.stdout.write(rotate13_letter(char)) sys.stdout.write('n') Now, from a command line, execute the script as follows. If you've entered all of the code correctly, you should see the same output. $ python rot13.py 'We are the knights who say, nee!' Run the script a second time, using the output of the first run as the new input string. If everything was entered correctly, the original text should be printed to the console. $ python rot13.py 'Dv ziv gsv pmrtsgh dsl hzb, mvv!' What just happened? We implemented a simple text-oriented cipher using a collection of Python's string handling features. We were able to see it put to use for both encoding and decoding source text. We saw a lot of stuff in this little example, so you should have a good feel for what can be accomplished using the standard Python string object. Following our initial module imports, we defined a dictionary named CHAR_MAP, which gives us a nice and simple way to shift our letters by the required 13 places. The value of a dictionary key is the target letter! We also took advantage of string slicing here. In our translation function rotate13_letter, we checked whether our input character was uppercase or lowercase and then saved that as a Boolean attribute. We then forced our input to lowercase for the translation work. As ROT13 operates on letters alone, we only performed a rotation if our input character was a letter of the Latin alphabet. We allowed other values to simply pass through. We could have just as easily forced our string to a pure uppercased value. The last thing we do in our function is restore the letter to its proper case, if necessary. This should familiarize you with upper- and lowercasing of Python ASCII strings. We're able to change the case of an entire string using this same method; it's not limited to single characters. >>> name = 'Ryan Miller'>>> name.upper()'RYAN MILLER'>>> "PLEASE DO NOT SHOUT".lower()'please do not shout'>>> It's worth pointing out here that a single character string is still a string. There is not a char type, which you may be familiar with if you're coming from a different language such as C or C++. However, it is possible to translate between character ASCII codes and back using the ord and chr built-in methods and a string with a length of one. Notice how we were able to loop through a string directly using the Python for syntax. A string object is a standard Python iterable, and we can walk through them detailed as follows. In practice, however, this isn't something you'll normally do. In most cases, it makes sense to rely on existing libraries. $ pythonPython 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)[GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> for char in "Foo":... print char...Foo>>> Finally, you should note that we ended our script with an if statement such as the following: Python modules all contain an internal __name__ variable that corresponds to the name of the module. If a module is executed directly from the command line, as is this script, whose name value is set to __main__, this code only runs if we've executed this script directly. It will not run if we import this code from a different script. You can import the code directly from the command line and see for yourself. >>> if__name__ == '__main__' Notice how we were able to import our module and see all of the methods and attributes inside of it, but the driver code did not execute. Have a go hero – more translation work Each Python string instance contains a collection of methods that operate on one or more characters. You can easily display all of the available methods and attributes by using the dir method. For example, enter the following command into a Python window. Python responds by printing a list of all methods on a string object. $ pythonPython 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)[GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import rot13>>> dir(rot13)['CHAR_MAP', '__builtins__', '__doc__', '__file__', '__name__', '__package__', 'rotate13_letter', 'string', 'sys']>>> Much like the isupper and islower methods discussed previously, we also have an isspace method. Using this method, in combination with your newfound knowledge of Python strings, update the method we defined previously to translate spaces to underscores and underscores to spaces. Processing structured markup with a filter Our ROT13 application works great for simple one-line strings that we can fit on the command line. However, it wouldn't work very well if we wanted to encode an entire file, such as the HTML document we took a look at earlier. In order to support larger text documents, we'll need to change the way we accept input. We'll redesign our application to work as a filter. A filter is an application that reads data from its standard input file descriptor and writes to its standard output file descriptor. This allows users to create command pipelines that allow multiple utilities to be strung together. If you've ever typed a command such as cat /etc/hosts | grep mydomain.com, you've set up a pipeline In many circumstances, data is fed into the pipeline via the keyboard and completes its journey when a processed result is displayed on the screen. Time for action – processing as a filter Let's make the changes required to allow our simple ROT13 processor to work as a command-line filter. This will allow us to process larger files. Create a new source file and enter the following code. When complete, save the file as rot13-b.py. >>> dir("content")['__add__', '__class__', '__contains__', '__delattr__', '__doc__','__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__','__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__','__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count','decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index','isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle','isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace','rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split','splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate','upper', 'zfill']>>> Enter the following HTML data into a new text file and save it as sample_page.html. We'll use this as example input to our updated rot13.py. import sysimport stringCHAR_MAP = dict(zip( string.ascii_lowercase, string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13] ))def rotate13_letter(letter): """ Return the 13-char rotation of a letter. """ do_upper = False if letter.isupper(): do_upper = True letter = letter.lower() if letter not in CHAR_MAP: return letter else: letter = CHAR_MAP[letter] if do_upper: letter = letter.upper() return letterif __name__ == '__main__': for line in sys.stdin: for char in line: sys.stdout.write(rotate13_letter(char)) Now, run our rot13.py example and provide our HTML document as standard input data. The exact method used will vary with your operating system. If you've entered the code successfully, you should simply see a new prompt. <html> <head> <title>Hello, World!</title> </head> <body> <p> Hi there, all of you earthlings. </p> <p> Take us to your leader. </p> </body></html> The contents of rot13.html should be as follows. If that's not the case, double back and make sure everything is correct. $ cat sample_page.html | python rot13-b.py > rot13.html$ Open the translated HTML file using your web browser. What just happened? We updated our rot13.py script to read standard input data rather than rely on a command-line option. Doing this provides optimal configurability going forward and lets us feed input of varying length from a collection of different sources. We did this by looping on each line available on the sys.stdin file stream and calling our translation function. We wrote each character returned by that function to the sys.stdout stream. Next, we ran our updated script via the command line, using sample_page.html as input. As expected, the encoded version was printed on our terminal. As you can see, there is a major problem with our output. We should have a proper page title and our content should be broken down into different paragraphs. Remember, structured markup text is sprinkled with tag elements that define its structure and organization. In this example, we not only translated the text content, we also translated the markup tags, rendering them meaningless. A web browser would not be able to display this data properly. We'll need to update our processor code to ignore the tags. We'll do just that in the next section.
Read more
  • 0
  • 0
  • 2306

article-image-microsoft-sql-azure-tools
Packt
07 Jan 2011
6 min read
Save for later

Microsoft SQL Azure Tools

Packt
07 Jan 2011
6 min read
  Microsoft SQL Azure Enterprise Application Development Build enterprise-ready applications and projects with SQL Azure Develop large scale enterprise applications using Microsoft SQL Azure Understand how to use the various third party programs such as DB Artisan, RedGate, ToadSoft etc developed for SQL Azure Master the exhaustive Data migration and Data Synchronization aspects of SQL Azure. Includes SQL Azure projects in incubation and more recent developments including all 2010 updates Appendix         Read more about this book       (For more resources on Microsoft Azure, see here.) SQL Azure is a subset of SQL Server 2008 R2 and, as such, the tools that are used with SQL Server can also be used with SQL Azure, albeit with only some of the features supported. The Microsoft tools can also be divided further between those that are accessed from the Visual Studio series of products and those that are not. However, it appears that with Visual Studio 2010, the SSMS may be largely redundant for most commonly used tasks. Visual Studio related From the very beginning, Visual Studio supported developing data-centric applications working together with the existing version of SQL Server, as well as MS Access databases, and databases from third parties. The various client APIs such as ODBC, OLEDB, and ADO.NET made this happen, not only for direct access to manipulate data on the server, but also for supporting web-facing applications to interact with the databases. There are two versions, of particular interest to SQL Azure, that are specially highlighted as they allow for creating applications consuming data from the SQL Azure Server. Visual Studio 2008 SP1 and Visual Studio 2010 RC were released (April 12, 2010) recently for production. The new key features of the more recent update to SQL Azure are here: http://hodentek.blogspot.com/2010/04/features-galore-for-sql-azure.html. It may be noted that the SQL Azure portal provides the all-important connection strings that is crucial for connecting to SQL Azure using Visual Studio. VS2008 Although you can access and work with SQL Azure using Visual Studio 2008, it does not support establishing a data connection to SQL Azure using the graphic user interface, like you can with an on-site application. This has been remedied in Visual Studio 2010. VS2010 Visual Studio 2010 has a tighter integration with many more features than Visual Studio 2008 SP1. It is earmarked to make the cloud offering more attractive. Of particular interest to SQL Azure is the support it offers in making a connection to SQL Azure through its interactive graphic user interface and the recently introduced feature supporting Data-tier applications. A summary detail of the data tier applications are here: http://hodentek.blogspot.com/2010/02/working-with-data-gotten-lot-easier.html. SQLBulkCopy for Data Transfer In .NET 2.0, the SQLBulkCopy class in the System.Data.SqlClient namespace was introduced. This class makes it easy to move data from one server to another using Visual Studio. An example is described in the next chapter using Visual Studio 2010 RC, but a similar method can be adopted in Visual Studio 2008.SQL Server Business Development Studio (BIDS). The Business Development Studio (BIDS) would fall under both SQL Server 2008 and Visual Studio. The tight integration of Visual Studio with SQL Server was instrumental in the development of BIDS. Starting off to a successful introduction in Visual Studio 2005 more enhancements were added in Visual Studio 2008, both to the Integration Services as well as Reporting Services, two of the main features of BIDS. BIDS is available as a part of the Visual Studio shell when you install the recommended version of SQL Server. Even if you do not have Visual Studio installed, you would get a part of Visual Studio that is needed for developing business intelligence-related integration services as well as reporting services applications. SQL Server Integration Services Microsoft SQL Server Integration Services (SSIS) is a comprehensive data integration service that superseded the Data Transformation Services. Through its connection managers it can establish connections to a variety of data sources that includes SQL Azure. Many of the data intensive tasks from onsite to SQL Azure can be carried out in SSIS. SQL Server Reporting Services SQL Server Reporting Services (SSRS) is a comprehensive reporting package that consists of a Report Server tightly integrated with SQL Server 2008 R2 and a webbased frontend, client software - the Report Manager. SSRS can spin-off reports from data stored on SQL Azure through its powerful data binding interface. Entity Framework Provider Like ODBC, OLE DB, and ADO.NET data providers Entity Framework also features an Entity Framework Provider although, it does not connect to SQL Azure like the others. Using Entity Framework Provider you can create data services for a SQL Azure database. .NET client applications can access these services. In order to work with Entity Framework Provider you need to install Windows Azure SDK (there are several versions of these), which provides appropriate templates. Presently, the Entity Framework Provider cannot create the needed files for a database on SQL Azure. A workaround is adopted. SQL Server related The tools described in this section can directly access SQL Server 2008 R2. The Export/ Import Wizard cannot only access SQL Servers but also products from other vendors to which a connection can be established. Scripting support, BCP, and SSRS are all effective when the database server is installed and are part of the SQL Server 2008 R2 installation. SQL Server Integration Services is tightly integrated with SQL Server 2008 R2, which can store packages created using SSIS. Data access technologies and self-service business integration technologies are developing rapidly and will impact on cloud-based applications including solutions using SQL Azure. SQL Server Management Studio SQL Server Management Studio (SSMS) has been the work horse with the SQL Servers from the very beginning. SSMS also supports working with SQL Azure Server. SQL Server 2008 did not fully support SQL Azure, except it did enable connection to SQL Azure. This was improved in SQL Server 2008 R2 November-CTP and the versions to appear later. SSMS also provides the SQL Azure template, which includes most of the commands you will be using in SQL Azure. Import/Export Wizard The Import/Export Wizard has been present in SQL Servers even from earlier versions to create DTS packages of a simple nature. It could be started from the command line using DTSWiz.exe or DTSWizard.exe. In the same folder, you can access all dts-related files. You can double-click Import and Export Data (32-bit) in Start | All Programs | Microsoft SQL Server 2008 R2. You may also run the DTSWizard.exe from a DOS prompt to launch the wizard. The Import/Export wizard may be able to connect to SQL Azure using any of the following: ODBC DSN SQL Server Native Client 10.0 .NET Data Source Provider for SqlServer The Import/Export wizard can connect to the SQL Azure server using ODBC DSN. Although connection was possible, the export or import was not possible in the CTP version due to some unsupported stored procedure. Import/Export works with the .NET Framework Data Provider, but requires some tweaking.  
Read more
  • 0
  • 0
  • 2254

article-image-granting-access-mysql-python
Packt
28 Dec 2010
10 min read
Save for later

Granting Access in MySQL for Python

Packt
28 Dec 2010
10 min read
MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Introduction As with creating a user, granting access can be done by modifying the mysql tables directly. However, this method is error-prone and dangerous to the stability of the system and is, therefore, not recommended. Important dynamics of GRANTing access Where CREATE USER causes MySQL to add a user account, it does not specify that user's privileges. In order to grant a user privileges, the account of the user granting the privileges must meet two conditions: Be able to exercise those privileges in their account Have the GRANT OPTION privilege on their account Therefore, it is not just users who have a particular privilege or only users with the GRANT OPTION privilege who can authorize a particular privilege for a user, but only users who meet both requirements. Further, privileges that are granted do not take effect until the user's first login after the command is issued. Therefore, if the user is logged into the server at the time you grant access, the changes will not take effect immediately. The GRANT statement in MySQL The syntax of a GRANT statement is as follows: GRANT <privileges> ON <database>.<table> TO '<userid>'@'<hostname>'; Proceeding from the end of the statement, the userid and hostname follow the same pattern as with the CREATE USER statement. Therefore, if a user is created with a hostname specified as localhost and you grant access to that user with a hostname of '%', they will encounter a 1044 error stating access is denied. The database and table values must be specifi ed individually or collectively. This allows us to customize access to individual tables as necessary. For example, to specify access to the city table of the world database, we would use world.city. In many instances, however, you are likely to grant the same access to a user for all tables of a database. To do this, we use the universal quantifi er ('*'). So to specify all tables in the world database, we would use world.*. We can apply the asterisk to the database field as well. To specify all databases and all tables, we can use *.*. MySQL also recognizes the shorthand * for this. Finally, the privileges can be singular or a series of comma-separated values. If, for example, you want a user to only be able to read from a database, you would grant them only the SELECT privilege. For many users and applications, reading and writing is necessary but no ability to modify the database structure is warranted. In such cases, we can grant the user account both SELECT and INSERT privileges with SELECT, INSERT. To learn which privileges have been granted to the user account you are using, use the statement SHOW GRANTS FOR &ltuser>@hostname>;. With this in mind, if we wanted to grant a user tempo all access to all tables in the music database but only when accessing the server locally, we would use this statement: GRANT ALL PRIVILEGES ON music.* TO 'tempo'@'localhost'; Similarly, if we wanted to restrict access to reading and writing when logging in remotely, we would change the above statement to read: GRANT SELECT,INSERT ON music.* TO 'tempo'@'%'; If we wanted user conductor to have complete access to everything when logged in locally, we would use: GRANT ALL PRIVILEGES ON * TO 'conductor'@'localhost'; Building on the second example statement, we can further specify the exact privileges we want on the columns of a table by including the column numbers in parentheses after each privilege. Hence, if we want tempo to be able to read from columns 3 and 4 but only write to column 4 of the sheets table in the music database, we would use this command: GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%'; Note that specifying columnar privileges is only available when specifying a single database table—use of the asterisk as a universal quantifi er is not allowed. Further, this syntax is allowed only for three types of privileges: SELECT, INSERT, and UPDATE. A list of privileges that are available through MySQL is reflected in the following table: MySQL does not support the standard SQL UNDER privilege and does not support the use of TRIGGER until MySQL 5.1.6. More information on MySQL privileges can be found at http://dev.mysql.com/doc/refman/5.5/en/privileges-provided.html Using REQUIREments of access Using GRANT with a REQUIRE clause causes MySQL to use SSL encryption. The standard used by MySQL for SSL is the X.509 standard of the International Telecommunication Union's (ITU) Standardization Sector (ITU-T). It is a commonly used public-key encryption standard for single sign-on systems. Parts of the standard are no longer in force. You can read about the parts which still apply on the ITU website at http://www.itu.int/rec/T-REC-X.509/en The REQUIRE clause takes the following arguments with their respective meanings and follows the format of their respective examples: NONE: The user account has no requirement for an SSL connection. This is the default. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%'; SSL: The client must use an SSL-encrypted connection to log in. In most MySQL clients, this is satisfied by using the --ssl-ca option at the time of login. Specifying the key or certifi cate is optional. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE SSL; X509: The client must use SSL to login. Further, the certificate must be verifiable with one of the CA vendors. This option further requires the client to use the --ssl-ca option as well as specifying the key and certificate using --ssl-key and --ssl-cert, respectively. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE X509; CIPHER: Specifies the type and order of ciphers to be used. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE CIPHER 'RSA-EDH-CBC3-DES-SHA'; ISSUER: Specifies the issuer from whom the certificate used by the client is to come. The user will not be able to login without a certificate from that issuer. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE ISSUER 'C=ZA, ST=Western Cape, L=Cape Town, O=Thawte Consulting cc, OU=Certification Services Division,CN=Thawte Server CA/emailAddress=server-certs@thawte. com'; SUBJECT: Specifies the subject contained in the certificate that is valid for that user. The use of a certificate containing any other subject is disallowed. GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' REQUIRE SUBJECT 'C=US, ST=California, L=Pasadena, O=Indiana Grones, OU=Raiders, CN=www.lostarks.com/ emailAddress=indy@lostarks.com'; Using a WITH clause MySQL's WITH clause is helpful in limiting the resources assigned to a user. WITH takes the following options: GRANT OPTION: Allows the user to provide other users of any privilege that they have been granted MAX_QUERIES_PER_HOUR: Caps the number of queries that the account is allowed to request in one hour MAX_UPDATES_PER_HOUR: Limits how frequently the user is allowed to issue UPDATE statements to the database MAX_CONNECTIONS_PER_HOUR: Limits the number of logins that a user is allowed to make in one hour MAX_USER_CONNECTIONS: Caps the number of simultaneous connections that the user can make at one time It is important to note that the GRANT OPTION argument to WITH has a timeless aspect. It does not statically apply to the privileges that the user has just at the time of issuance, but if left in effect, applies to any options the user has at any point in time. So, if the user is granted the GRANT OPTION for a temporary period, but the option is never removed, then the user grows in responsibilities and privileges, that user can grant those privileges to any other user. Therefore, one must remove the GRANT OPTION when it is not longer appropriate. Note also that if a user with access to a particular MySQL database has the ALTER privilege and is then granted the GRANT OPTION privilege, that user can then grant ALTER privileges to a user who has access to the mysql database, thus circumventing the administrative privileges otherwise needed. The WITH clause follows all other options given in a GRANT statement. So, to grant user tempo the GRANT OPTION, we would use the following statement: GRANT SELECT (col3,col4),INSERT (col4) ON music.sheets TO 'tempo'@'%' WITH GRANT OPTION; If we want to limit the number of queries that the user can have in one hour to five, as well, we simply add to the argument of the single WITH statement. We do not need to use WITH a second time. GRANT SELECT,INSERT ON music.sheets TO 'tempo'@'%' WITH GRANT OPTION MAX_QUERIES_PER_HOUR 5; More information on the many uses of WITH can be found at http://dev.mysql.com/doc/refman/5.1/en/grant.html Granting access in Python Using MySQLdb to enable user privileges is not more difficult than doing so in MySQL itself. As with creating and dropping users, we simply need to form the statement and pass it to MySQL through the appropriate cursor. As with the native interface to MySQL, we only have as much authority in Python as our login allows. Therefore, if the credentials with which a cursor is created has not been given the GRANT option, an error will be thrown by MySQL and MySQLdb, subsequently. Assuming that user skipper has the GRANT option as well as the other necessary privileges, we can use the following code to create a new user, set that user's password, and grant that user privileges: #!/usr/bin/env python import MySQLdb host = 'localhost' user = 'skipper' passwd = 'secret' mydb = MySQLdb.connect(host, user, passwd) cursor = mydb.cursor() try: mkuser = 'symphony' creation = "CREATE USER %s@'%s'" %(mkuser, host) results = cursor.execute(creation) print "User creation returned", results mkpass = 'n0n3wp4ss' setpass = "SET PASSWORD FOR '%s'@'%s' = PASSWORD('%s')" %(mkuser, host, mkpass) results = cursor.execute(setpass) print "Setting of password returned", results granting = "GRANT ALL ON *.* TO '%s'@'%s'" %(mkuser, host) results = cursor.execute(granting) print "Granting of privileges returned", results except MySQLdb.Error, e: print e If there is an error anywhere along the way, it is printed to screen. Otherwise, the several print statements are executed. As long as they all return 0, each step was successful.
Read more
  • 0
  • 0
  • 12651

article-image-getting-and-running-mysql-python
Packt
24 Dec 2010
8 min read
Save for later

Getting Up and Running with MySQL for Python

Packt
24 Dec 2010
8 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Getting MySQL for Python How you get MySQL for Python depends on your operating system and the level of authorization you have on it. In the following subsections, we walk through the common operating systems and see how to get MySQL for Python on each. Using a package manager (only on Linux) Package managers are used regularly on Linux, but none come by default with Macintosh and Windows installations. So users of those systems can skip this section. A package manager takes care of downloading, unpacking, installing, and configuring new software for you. In order to use one to install software on your Linux installation, you will need administrative privileges. Administrative privileges on a Linux system can be obtained legitimately in one of the following three ways: Log into the system as the root user (not recommended) Switch user to the root user using su Use sudo to execute a single command as the root user The first two require knowledge of the root user's password. Logging into a system directly as the root user is not recommended due to the fact that there is no indication in the system logs as to who used the root account. Logging in as a normal user and then switching to root using su is better because it keeps an account of who did what on the machine and when. Either way, if you access the root account, you must be very careful because small mistakes can have major consequences. Unlike other operating systems, Linux assumes that you know what you are doing if you access the root account and will not stop you from going so far as deleting every file on the hard drive. Unless you are familiar with Linux system administration, it is far better, safer, and more secure to prefix the sudo command to the package manager call. This will give you the benefit of restricting use of administrator-level authority to a single command. The chances of catastrophic mistakes are therefore mitigated to a great degree. More information on any of these commands is available by prefacing either man or info before any of the preceding commands (su, sudo). Which package manager you use depends on which of the two mainstream package management systems your distribution uses. Users of RedHat or Fedora, SUSE, or Mandriva will use the RPM Package Manager (RPM) system. Users of Debian, Ubuntu, and other Debian-derivatives will use the apt suite of tools available for Debian installations. Each package is discussed in the following: Using RPMs and yum If you use SUSE, RedHat, or Fedora, the operating system comes with the yum package manager . You can see if MySQLdb is known to the system by running a search (here using sudo): sudo yum search mysqldb If yum returns a hit, you can then install MySQL for Python with the following command: sudo yum install mysqldb Using RPMs and urpm If you use Mandriva, you will need to use the urpm package manager in a similar fashion. To search use urpmq: sudo urpmq mysqldb And to install use urpmi: sudo urpmi mysqldb Using apt tools on Debian-like systems Whether you run a version of Ubuntu, Xandros, or Debian, you will have access to aptitude, the default Debian package manager. Using sudo we can search for MySQLdb in the apt sources using the following command: sudo aptitude search mysqldb On most Debian-based distributions, MySQL for Python is listed as python-mysqldb. Once you have found how apt references MySQL for Python, you can install it using the following code: sudo aptitude install python-mysqldb Using a package manager automates the entire process so you can move to the section Importing MySQL for Python. Using an installer for Windows Windows users will need to use the older 1.2.2 version of MySQL for Python. Using a web browser, go to the following link: http://sourceforge.net/projects/mysql-python/files/ This page offers a listing of all available files for all platforms. At the end of the file listing, find mysql-python and click on it. The listing will unfold to show folders containing versions of MySQL for Python back to 0.9.1. The version we want is 1.2.2. Windows binaries do not currently exist for the 1.2.3 version of MySQL for Python. To get them, you would need to install a C compiler on your Windows installation and compile the binary from source. Click on 1.2.2 and unfold the file listing. As you will see, the Windows binaries are differentiated by Python version—both 2.4 and 2.5 are supported. Choose the one that matches your Python installation and download it. Note that all available binaries are for 32-bit Windows installations, not 64-bit. After downloading the binary, installation is a simple matter of double-clicking the installation EXE file and following the dialogue. Once the installation is complete, the module is ready for use. So go to the section Importing MySQL for Python. Using an egg file One of the easiest ways to obtain MySQL for Python is as an egg file, and it is best to use one of those files if you can. Several advantages can be gained from working with egg files such as: They can include metadata about the package, including its dependencies They allow for the use of egg-aware software, a helpful level of abstraction Eggs can, technically, be placed on the Python executable path and used without unpacking They save the user from installing packages for which they do not have the appropriate version of software They are so portable that they can be used to extend the functionality of third-party applications Installing egg handling software One of the best known egg utilities—Easy Install, is available from the PEAK Developers' Center at http://peak.telecommunity.com/DevCenter/EasyInstall. How you install it depends on your operating system and whether you have package management software available. In the following section, we look at several ways to install Easy Install on the most common systems. Using a package manager (Linux) On Ubuntu you can try the following to install the easy_install tool (if not available already): shell> sudo aptitude install python-setuptools On RedHat or CentOS you can try using the yum package manager: shell> sudo yum install python-setuptools On Mandriva use urpmi: shell> sudo urpmi python-setuptools You must have administrator privileges to do the installations just mentioned. Without a package manager (Mac, Linux) If you do not have access to a Linux package manager, but nonetheless have a Unix variant as your operating system (for example, Mac OS X), you can install Python's setuptools manually. Go to: http://pypi.python.org/pypi/setuptools#files Download the relevant egg file for your Python version. When the file is downloaded, open a terminal and change to the download directory. From there you can run the egg file as a shell script. For Python 2.5, the command would look like this: sh setuptools-0.6c11-py2.5.egg This will install several files, but the most important one for our purposes is easy_install, usually located in /usr/bin. On Microsoft Windows On Windows, one can download the setuptools suite from the following URL: http://pypi.python.org/pypi/setuptools#files From the list located there, select the most appropriate Windows executable file. Once the download is completed, double-click the installation file and proceed through the dialogue. The installation process will set up several programs, but the one important for our purposes is easy_install.exe. Where this is located will differ by installation and may require using the search function from the Start Menu. On 64-bit Windows, for example, it may be in the Program Files (x86) directory. If in doubt, do a search. On Windows XP with Python 2.5, it is located here: C:Python25Scriptseasy_install.exe Note that you may need administrator privileges to perform this installation. Otherwise, you will need to install the software for your own use. Depending on the setup of your system, this may not always work. Installing software on Windows for your own use requires the following steps: Copy the setuptools installation file to your Desktop. Right-click on it and choose the runas option. Enter the name of the user who has enough rights to install it (presumably yourself) After the software has been installed, ensure that you know the location of the easy_install.exe file. You will need it to install MySQL for Python.
Read more
  • 0
  • 0
  • 3200
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-tips-tricks-mysql-python
Packt
23 Dec 2010
5 min read
Save for later

Tips & Tricks on MySQL for Python

Packt
23 Dec 2010
5 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server         Read more about this book       (For more resources on this subject, see here.) Objective: Install a C compiler on Windows installation. Tip: Windows binaries do not currently exist for the 1.2.3 version of MySQL for Python. To get them, you would need to install a C compiler on your Windows installation and compile the binary from source. Objective: Use tar.gz to use egg file. Tip: If you cannot use egg files or if you use an earlier version of Python, you should use the tar.gz file, a tar and gzip archive. The tar.gz archive follows the Linux egg files in the file listing. The current version of MySQL for Python is 1.2.3c1, so the file we want is as following: MySQL-python-1.2.3c1.tar.gz This method is by far more complicated than the others. If at all possible, use your operating system's installation method or an egg file. Objective: Limitation of using MySQL for Python on Python version. Tip: This version of MySQL for Python is compatible up to Python 2.6. It is worth noting that MySQL for Python has not yet been released for Python 3.0 or later versions. In your deployment of the library, therefore, ensure that you are running Python 2.6 or earlier. As noted, Python 2.5 and 2.6 have version-specifi c releases. Prior to Python 2.4, you will need to use either a tar.gz version of the latest release or use an older version of MySQL for Python. The latter option is not recommended. Objective: It is important to phrase the query in such a way as to narrow the returned values as much as possible. Tip: Here, instead of returning whole records, we tell MySQL to return only the namecolumn. This natural reduction in the data reduces processing time for both MySQL and Python. This saving is then passed on to your server in the form of more sessions able to be run at one time. Objective: This hard-wiring of the search query allows us to test the connection before coding the rest of the function. Tip: There may be a tendency here to insert user-determined variables immediately. With experience, it is possible to do this. However, if there are any doubts about the availability of the database, your best fallback position is to keep it simple and hardwired. This reduces the number of variables in making a connection and helps one to blackbox the situation, making troubleshooting much easier. Objective: Readability counts. Tip: The virtue of readability in programming is often couched in terms of being kind to the next developer who works on your code. There is more at stake, however. With readability comes not only maintainability but control. If it takes you too much effort to understand the code you have written, you will have a harder time controlling the program's flow and this will result in unintended behavior. The natural consequence of unintended program behavior is the compromising of process stability and system security. Objective: Quote marks not necessary when assigning MySQL statements. Tip: It is not necessary to use triple quote marks when assigning the MySQL sentence to statement or when passing it to execute(). However, if you used only a single pair of either double or single quotes, it would be necessary to escape every similar quote mark. As a stylistic rule, it is typically best to switch to verbatim mode with the triple quote marks in order to ensure the readability of your code. Objective: xrange() is much more memory efficient than range(). Tip: The differences between xrange() and range() are often overlooked or even ignored. Both count through the same values, but they do it differently. Where range() calculates a list the first time it is called and then stores it in memory, xrange() creates an immutable sequence that returns the next in the series each time it is called. As a consequence, xrange() is much more memory efficient than range(), especially when dealing with large groups of integers. As a consequence of its memory efficiency, however, it does not support functionality such as slicing, which range() does, because the series is not yet fully determined. Objective: autocommit feature is useful in MySQL for Python . Tip: Unless you are running several database threads at a time or have to deal with similar complexity, MySQL for Python does not require you to use either commit() or close(). Generally speaking, MySQL for Python installs with an autocommit feature switched on. It thus takes care of committing the changes for you when the cursor object is destroyed. Similarly, when the program terminates, Python tends to close the cursor and database connection as it destroys both objects.
Read more
  • 0
  • 0
  • 5412

article-image-advanced-output-formats-python-26-text-processing
Packt
21 Dec 2010
11 min read
Save for later

Advanced Output Formats in Python 2.6 Text Processing

Packt
21 Dec 2010
11 min read
  Python 2.6 Text Processing: Beginners Guide The easiest way to learn how to manipulate text with Python The easiest way to learn text processing with Python Deals with the most important textual data formats you will encounter Learn to use the most popular text processing libraries available for Python Packed with examples to guide you through We'll not dive into too much detail with any single approach. Rather, the goal of this article is to teach you the basics such that you can get started and further explore details on your own. Also, remember that our goal isn't to be pretty; it's to present a useable subset of functionality. In other words, our PDF layouts are ugly! Unfortunately, the third-party packages used in this article are not yet compatible with Python 3. Therefore, the examples listed here will only work with Python 2.6 and 2.7. Dealing with PDF files using PLATYPUS The ReportLab framework provides an easy mechanism for dealing with PDF files. It provides a low-level interface, known as pdfgen, as well as a higher-level interface, known as PLATYPUS. PLATYPUS is an acronym, which stands for Page Layout and Typography Using Scripts. While the pdfgen framework is incredibly powerful, we'll focus on the PLATYPUS system here as it's slightly easier to deal with. We'll still use some of the lower-level primitives as we create and modify our PLATYPUS rendered styles. The ReportLab Toolkit is not entirely Open Source. While the pieces we use here are indeed free to use, other portions of the library fall under a commercial license. We'll not be looking at any of those components here. For more information, see the ReportLab website, available at http://www.reportlab.com Time for action – installing ReportLab Like all of the other third-party packages we've installed thus far, the ReportLab Toolkit can be installed using SetupTools' easy_install command. Go ahead and do that now from your virtual environment. We've truncated the output that we are about to see in order to conserve on space. Only the last lines are shown. (text_processing)$ easy_install reportlab What just happened? The ReportLab package was downloaded and installed locally. Note that some platforms may require a C compiler in order to complete the installation process. To verify that the packages have been installed correctly, let's simply display the version tag. (text_processing)$ python Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import reportlab >>> reportlab.Version '2.4' >>> Generating PDF documents In order to build a PDF document using PLATYPUS, we'll arrange elements onto a document template via a flow. The flow is simply a list element that contains our individual document components. When we finally ask the toolkit to generate our output file, it will merge all of our individual components together and produce a PDF. Time for action – writing PDF with basic layout and style In this example, we'll generate a PDF that contains a set of basic layout and style mechanisms. First, we'll create a cover page for our document. In a lot of situations, we want our first page to differ from the remainder of our output. We'll then use a different format for the remainder of our document. Create a new Python file and name it pdf_build.py. Copy the following code as it appears as follows: import sys from report lab.PLATYPUS import SimpleDocTemplate, Paragraph from reportlab.PLATYPUS import Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet from reportlab.rl_config import defaultPageSize from reportlab.lib.units import inch from reportlab.lib import colors class PDFBuilder(object): HEIGHT = defaultPageSize[1] WIDTH = defaultPageSize[0] def _intro_style(self): """Introduction Specific Style""" style = getSampleStyleSheet()['Normal'] style.fontName = 'Helvetica-Oblique' style.leftIndent = 64 style.rightIndent = 64 style.borderWidth = 1 style.borderColor = colors.black style.borderPadding = 10 return style def __init__(self, filename, title, intro): self._filename = filename self._title = title self._intro = intro self._style = getSampleStyleSheet()['Normal'] self._style.fontName = 'Helvetica' def title_page(self, canvas, doc): """ Write our title page. Generates the top page of the deck, using some special styling. """ canvas.saveState() canvas.setFont('Helvetica-Bold', 18) canvas.drawCentredString( self.WIDTH/2.0, self.HEIGHT-180, self._title) canvas.setFont('Helvetica', 12) canvas.restoreState() def std_page(self, canvas, doc): """ Write our standard pages. """ canvas.saveState() canvas.setFont('Helvetica', 9) canvas.drawString(inch, 0.75*inch, "%d" % doc.page) canvas.restoreState() def create(self, content): """ Creates a PDF. Saves the PDF named in self._filename. The content parameter is an iterable; each line is treated as a standard paragraph. """ document = SimpleDocTemplate(self._filename) flow = [Spacer(1, 2*inch)] # Set our font and print the intro # paragraph on the first page. flow.append( Paragraph(self._intro, self._intro_style())) flow.append(PageBreak()) # Additional content for para in content: flow.append( Paragraph(para, self._style)) # Space between paragraphs. flow.append(Spacer(1, 0.2*inch)) document.build( flow, onFirstPage=self.title_page, onLaterPages=self.std_page) if __name__ == '__main__': if len(sys.argv) != 5: print "Usage: %s <output> <title> <intro file> <content file>" % sys.argv[0] sys.exit(-1) # Do Stuff builder = PDFBuilder( sys.argv[1], sys.argv[2], open(sys.argv[3]).read()) # Generate the rest of the content from a text file # containing our paragraphs. builder.create(open(sys.argv[4])) Next, we'll create a text file that will contain the introductory paragraph. We've placed it in a separate file so it's easier to manipulate. Enter the following into a text file named intro.txt. This is an example document that we've created from scratch; it has no story to tell. It's purpose? To serve as an example. Now, we need to create our PDF content. Let's add one more text file and name paragraphs.txt. Feel free to create your own content here. Each new line will start a new paragraph in the resulting PDF. Our test data is as follows: This is the first paragraph in our document and it really serves no meaning other than example text. This is the second paragraph in our document and it really serves no meaning other than example text. This is the third paragraph in our document and it really serves no meaning other than example text. This is the fourth paragraph in our document and it really serves no meaning other than example text. This is the final paragraph in our document and it really serves no meaning other than example text. Now, let's run the PDF generation script (text_processing)$ python pdf_build.py output.pdf "Example Document" intro.txt paragraphs.txt If you view the generated document in a reader, the generated pages should resemble the following screenshots: The preceding screenshot displays the clean Title page, which we derive from the commandline arguments and the contents of the introduction file. The next screenshot contains document copy, which we also read from a file. What just happened? We used the ReportLab Toolkit to generate a basic PDF. In the process, you created two different layouts: one for the initial page and one for subsequent pages. The first page serves as our title page. We printed the document title and a summary paragraph. The second (and third, and so on) pages simply contain text data. At the top of our code, as always, we import the modules and classes that we'll need to run our script. We import SimpleDocTemplate, Paragraph, Spacer, and Pagebreak from the PLATYPUS module. These are items that will be added to our document flow. Next, we bring in getSampleStyleSheet. We use this method to generate a sample, or template, stylesheet that we can then change as we need. Stylesheets are used to provide appearance instructions to Paragraph objects here, much like they would be used in an HTML document. The last two lines import the inch size as well as some page size defaults. We'll use these to better lay out our content on the page. Note that everything here outside of the first line is part of the more general-purpose portion of the toolkit. The bulk of our work is handled in the PDFBuilder class we've defined. Here, we manage our styles and hide the PDF generation logic. The first thing we do here is assign the default document height and width to class variables named HEIGHT and WIDTH, respectively. This is done to make our code easier to work with and to make for easier inheritance down the road. The _intro_style method is responsible for generating the paragraph style information that we use for the introductory paragraph that appears in the box. First, we create a new stylesheet by calling getSampleStyleSheet. Next, we simply change the attributes that we wish to modify from default. The values in the preceding table define the style used for the introductory paragraph, which is different from the standard style. Note that this is not an exhaustive list; this simply details the variables that we've changed. Next we have our __init__ method. In addition to setting variables corresponding to the arguments passed, we also create a new stylesheet. This time, we simply change the font used to Helvetica (default is Times New Roman). This will be the style we use for default text. The next two methods, title_page and std_page, define layout functions that are called when the PDF engine generates both the first and subsequent pages. Let's walk through the title_page method in order to understand what exactly is happening. First, we save the current state of the canvas. This is a lower-level concept that is used throughout the ReportLab Toolkit. We then change the active font to a bold sans serif at 18 point. Next, we draw a string at a specific location in the center of the document. Lastly, we restore our state as it was before the method was executed. If you take a quick look at std_page, you'll see that we're actually deciding how to write the page number. The library isn't taking care of that for us. However, it does help us out by giving us the current page number in the doc object. Neither the std_page nor the title_page methods actually lay the text out. They're called when the pages are rendered to perform annotations. This means that they can do things such as write page numbers, draw logos, or insert callout information. The actual text formatting is done via the document flow. The last method we define is create, which is responsible for driving title page creation and feeding the rest of our data into the toolkit. Here, we create a basic document template via SimpleDocTemplate. We'll flow all of our components onto this template as we define them. Next, we create a list named flow that contains a Spacer instance. The Spacer ensures we do not begin writing at the top of the PDF document. We then build a Paragraph containing our introductory text, using the style built in the self._intro_style method. We append the Paragraph object to our flow and then force a page break by also appending a PageBreak object. Next, we iterate through all of the lines passed into the method as content. Each generates a new Paragraph object with our default style. Finally, we call the build method of the document template object. We pass it our flow and two different methods to be called - one when building the first page and one when building subsequent pages. Our __main__ section simply sets up calls to our PDFBuilder class and reads in our text files for processing. The ReportLab Toolkit is very heavily documented and is quite easy to work with. For more information, see the documents available at http://www.reportlab.com/software/opensource/. There is also a code snippets library that contains some common PDF recipes. Have a go hero – drawing a logo The toolkit provides easy mechanisms for including graphics directly into a PDF document. JPEG images can be included without any additional library support. Using the documentation referenced earlier, alter our title_page method such that you include a logo image below the introductory paragraph. Writing native Excel data Here, we'll look at an advanced technique that actually allows us to write actual Excel data (without requiring Microsoft Windows). To do this, we'll be using the xlwt package. Time for action – installing xlwt Again, like the other third-party modules we've installed thus far, xlwt can be downloaded and installed via the easy_install system. Activate your virtual environment and install it now. Your output should resemble the following: (text_processing)$ easy_install xlwt What just happened? We installed the xlwt packages from the Python Package Index. To ensure your install worked correctly, start up Python and display the current version of the xlwt libraries. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import xlwt >>> xlwt.__VERSION__ '0.7.2' >>> At the time of this writing, the xlwt module supports the generation of Excel xls format files, which are compatible with Excel 95 – 2003 (and later). MS Office 2007 and later utilizes Open Office XML (OOXML).
Read more
  • 0
  • 0
  • 3076

article-image-postgresql-tips-and-tricks
Packt
20 Dec 2010
7 min read
Save for later

PostgreSQL: Tips and Tricks

Packt
20 Dec 2010
7 min read
PostgreSQL 9.0 High Performance Accelerate your PostgreSQL system Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance Discover the techniques used to scale successful database installations Avoid the common pitfalls that can slow your system down Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment Covers versions 8.1 through 9.0      Upgrading without any replication software Tip: A program originally called pg_migrator at http://pgfoundry.org/projects/pg-migrator/ is capable of upgrading from 8.3 to 8.4 without the dump and reload. This process is called in-place upgrading. Minor version upgrades Tip: One good way to check if you have contrib modules installed is to see if the pgbench program is available. That's one of the few contrib components that installs a full program, rather than just the scripts you can use. Using an external drive for a database Tip: External drives connected over USB or Firewire can be quite crippled in their abilities to report SMART and other error information, due to both the limitations of the common USB/Firewire bridge chipsets used to connect them and the associated driver software. They may not properly handle write caching for similar reasons. You should avoid putting a database on an external drive using one of those connection methods. Newer external drives using external SATA (eSATA) are much better in this regard, because they're no different from directly attaching the SATA device. Implementing a software RAID Tip: When implementing a RAID array, you can do so with special hardware intended for that purpose. Many operating systems nowadays, from Windows to Linux, include software RAID that doesn't require anything beyond the disk controller on your motherboard. Driver support for Areca cards Tip: Driver support for Areca cards depends heavily upon the OS you're using, so be sure to check this carefully. Under Linux for example, you may have to experiment a bit to get a kernel whose Areca driver is extremely reliable, because this driver isn't popular enough to get a large amount of testing. The 2.6.22 kernel works well for several heavy PostgreSQL users with these cards. Free space map (FSM) settings Tip: Space left behind from deletions or updates of data is placed into a free space map by VACUUM, and then new allocations are done from that free space first, rather than by allocating new disk for them instead. Using a single leftover disk Tip: A typical use for a single leftover disk is to create a place to store non-critical backups and other working files, such as a database dump that needs to be processed before being shipped elsewhere. Ignoring crash recovery Tip: If you just want to ignore crash recovery altogether, you can do that by turning off the fsync parameter. This makes the value for wal_sync_method irrelevant, because the server won't be doing any WAL sync calls anymore. Disk layout guideline Tip: Avoid putting the WAL on the operating system drive, because they have completely different access patterns and both will suffer when combined. Normally this might work out fine initially, only to discover a major issue when the OS is doing things like a system update or daily maintenance activity. Rebuilding the filesystem database used by the locate utility each night is one common source on Linux for heavy OS disk activity. Splitting WAL on Linux systems running ext3 Tip: On Linux systems running ext3, where fsync cache flushes require dumping the entire OS cache out to disk, split the WAL onto another disk as soon as you have a pair to spare for that purpose. Common tuning techniques for good performance Tip: Increasing read-ahead, stopping updates to file access timestamps, and adjusting the amount of memory used for caching are common tuning techniques needed to get good performance on most operating systems. Optimization of default memory size Tip: The default memory sizes in the postgresql.conf are not optimized for performance or for any idea of a typical configuration. They are optimized solely so that the server can start on a system with low settings for the amount of shared memory it can allocate, because that situation is so common. A handy system column to know about; ctid Tip: ctid, which can still be used as a way to uniquely identify a row, even in situations where you have multiple rows with the same data in them. This provides a quick way to find a row more than once, and it can be useful for cleaning up duplicate rows from a database, too. Don't use pg_buffercache for regular monitoring Tip: pg_buffercache requires broad locks on parts of the buffer cache when it runs. As such, it's extremely intensive on the server when you run any of these queries. A snapshot on a daily basis or every few hours is usually enough to get a good idea how the server is using its cache, without having the monitoring itself introduce much of a load. Loading methods Tip: The preferred path to get a lot of data into the database is using the COPY command. This is the fastest way to insert a set of rows. If that's not practical, and you have to use INSERT instead, you should try to include as many records as possible per commit, wrapping several into a BEGIN/COMMIT block. External loading programs Tip: If you're importing from an external data source (a dump out of a non-PostgreSQL database for example), you should consider a loader that saves rejected rows while continuing to work anyway, like pgloader: http://pgfoundry.org/projects/pgloader/. pgloader will not be as fast as COPY, but it's easier to work with on dirty input data, and it can handle more types of input formats too. Tuning for bulk loads Tip: The most important thing to do in order to speed up bulk loads is to turn off any indexes or foreign key constraints on the table. It's more efficient to build indexes in bulk and the result will be less fragmented. Skipping WAL acceleration Tip: The purpose of the write-ahead log is to protect you from partially committed data being left behind after a crash. If you create a new table in a transaction, add some data to it, and then commit at the end, at no point during that process is the WAL really necessary. Parallel restore Tip: PostgreSQL 8.4 introduced an automatic parallel restore that lets you allocate multiple CPU cores on the server to their own dedicated loading processes. In addition to loading data into more than one table at once, running the parallel pg_restore will even usefully run multiple index builds in parallel. Post load cleanup Tip: Your data is loaded, your indexes recreated, and your constraints active. There are two maintenance chores you should consider before putting the server back into production. The first is a must-do: make sure to run ANALYZE against all the databases. This will make sure you have useful statistics for them before queries start running. Materialized views Tip: One of the most effective ways to speed up queries against large data sets that are run more than once is to cache the result in a materialized view, essentially a view that is run and its output stored for future reference. Summary In this article we looked at some of the tips and tricks on PostgreSQL. Further resources on this subject: Introduction to PostgreSQL 9 [Article] Recovery in PostgreSQL 9 [Article] UNIX Monitoring Tool for PostgreSQL [Article] Server Configuration Tuning in PostgreSQL [Article]
Read more
  • 0
  • 0
  • 2487

article-image-python-text-processing-nltk-2-transforming-chunks-and-trees
Packt
16 Dec 2010
10 min read
Save for later

Python Text Processing with NLTK 2: Transforming Chunks and Trees

Packt
16 Dec 2010
10 min read
  Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible Introduction This article will show you how to do various transforms on both chunks and trees. The chunk transforms are for grammatical correction and rearranging phrases without loss of meaning. The tree transforms give you ways to modify and flatten deep parse trees. The functions detailed in these recipes modify data, as opposed to learning from it. That means it's not safe to apply them indiscriminately. A thorough knowledge of the data you want to transform, along with a few experiments, should help you decide which functions to apply and when. Whenever the term chunk is used in this article, it could refer to an actual chunk extracted by a chunker, or it could simply refer to a short phrase or sentence in the form of a list of tagged words. What's important in this article is what you can do with a chunk, not where it came from. Filtering insignificant words Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase "the movie was terrible", the most significant words are "movie" and "terrible", while "the" and "was" are almost useless. You could get the same meaning if you took them out, such as "movie terrible" or "terrible movie". Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words, and keep the significant ones, by looking at their part-of-speech tags. Getting ready First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:   Word Tag a DT all PDT an DT and CC or CC that WDT the DT Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix. How to do it... In transforms.py there is a function called filter_insignificant(). It takes a single chunk, which should be a list of tagged words, and returns a new chunk without any insignificant tagged words. It defaults to filtering out any tags that end with DT or CC. def filter_insignificant(chunk, tag_suffixes=['DT', 'CC']): good = [] for word, tag in chunk: ok = True for suffix in tag_suffixes: if tag.endswith(suffix): ok = False break if ok: good.append((word, tag)) return good Now we can use it on the part-of-speech tagged version of "the terrible movie". >>> from transforms import filter_insignificant >>> filter_insignificant([('the', 'DT'), ('terrible', 'JJ'), ('movie', 'NN')]) [('terrible', 'JJ'), ('movie', 'NN')] As you can see, the word "the" is eliminated from the chunk. How it works... filter_insignificant() iterates over the tagged words in the chunk. For each tag, it checks if that tag ends with any of the tag_suffixes. If it does, then the tagged word is skipped. However if the tag is ok, then the tagged word is appended to a new good chunk that is returned. There's more... The way filter_insignificant() is defined, you can pass in your own tag suffixes if DT and CC are not enough, or are incorrect for your case. For example, you might decide that possessive words and pronouns such as "you", "your", "their", and "theirs" are no good but DT and CC words are ok. The tag suffixes would then be PRP and PRP$. Following is an example of this function: >>> filter_insignificant([('your', 'PRP$'), ('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')], tag_suffixes=['PRP', 'PRP$']) [('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')] Filtering insignificant words can be a good complement to stopword filtering for purposes such as search engine indexing, querying, and text classification. Correcting verb forms It's fairly common to find incorrect verb forms in real-world language. For example, the correct form of "is our children learning?" is "are our children learning?". The verb "is" should only be used with singular nouns, while "are" is for plural nouns, such as "children". We can correct these mistakes by creating verb correction mappings that are used depending on whether there's a plural or singular noun in the chunk. Getting ready We first need to define the verb correction mappings in transforms.py. We'll create two mappings, one for plural to singular, and another for singular to plural. plural_verb_forms = { ('is', 'VBZ'): ('are', 'VBP'), ('was', 'VBD'): ('were', 'VBD') } singular_verb_forms = { ('are', 'VBP'): ('is', 'VBZ'), ('were', 'VBD'): ('was', 'VBD') } Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping, is to are, was to were, and vice versa. How to do it... In transforms.py there is a function called correct_verbs(). Pass it a chunk with incorrect verb forms, and you'll get a corrected chunk back. It uses a helper function first_chunk_index() to search the chunk for the position of the first tagged word where pred returns True. def first_chunk_index(chunk, pred, start=0, step=1): l = len(chunk) end = l if step > 0 else -1 for i in range(start, end, step): if pred(chunk[i]): return i return None def correct_verbs(chunk): vbidx = first_chunk_index(chunk, lambda (word, tag): tag. startswith('VB')) # if no verb found, do nothing if vbidx is None: return chunk verb, vbtag = chunk[vbidx] nnpred = lambda (word, tag): tag.startswith('NN') # find nearest noun to the right of verb nnidx = first_chunk_index(chunk, nnpred, start=vbidx+1) # if no noun found to right, look to the left if nnidx is None: nnidx = first_chunk_index(chunk, nnpred, start=vbidx-1, step=-1) # if no noun found, do nothing if nnidx is None: return chunk noun, nntag = chunk[nnidx] # get correct verb form and insert into chunk if nntag.endswith('S'): chunk[vbidx] = plural_verb_forms.get((verb, vbtag), (verb, vbtag)) else: chunk[vbidx] = singular_verb_forms.get((verb, vbtag), (verb, vbtag)) return chunk When we call it on a part-of-speech tagged "is our children learning" chunk, we get back the correct form, "are our children learning". >>> from transforms import correct_verbs >>> correct_verbs([('is', 'VBZ'), ('our', 'PRP$'), ('children', 'NNS'), ('learning', 'VBG')]) [('are', 'VBP'), ('our', 'PRP$'), ('children', 'NNS'), ('learning', 'VBG')] We can also try this with a singular noun and an incorrect plural verb. >>> correct_verbs([('our', 'PRP$'), ('child', 'NN'), ('were', 'VBD'), ('learning', 'VBG')]) [('our', 'PRP$'), ('child', 'NN'), ('was', 'VBD'), ('learning', 'VBG')] In this case, "were" becomes "was" because "child" is a singular noun. How it works... The correct_verbs() function starts by looking for a verb in the chunk. If no verb is found, the chunk is returned with no changes. Once a verb is found, we keep the verb, its tag, and its index in the chunk. Then we look on either side of the verb to find the nearest noun, starting on the right, and only looking to the left if no noun is found on the right. If no noun is found at all, the chunk is returned as is. But if a noun is found, then we lookup the correct verb form depending on whether or not the noun is plural. Plural nouns are tagged with NNS, while singular nouns are tagged with NN. This means we can check the plurality of a noun by seeing if its tag ends with S. Once we get the corrected verb form, it is inserted into the chunk to replace the original verb form. To make searching through the chunk easier, we define a function called first_chunk_ index(). It takes a chunk, a lambda predicate, the starting index, and a step increment. The predicate function is called with each tagged word until it returns True. If it never returns True, then None is returned. The starting index defaults to zero and the step increment to one. As you'll see in upcoming recipes, we can search backwards by overriding start and setting step to -1. This small utility function will be a key part of subsequent transform functions. Swapping verb phrases Swapping the words around a verb can eliminate the passive voice from particular phrases. For example, "the book was great" can be transformed into "the great book". How to do it... In transforms.py there is a function called swap_verb_phrase(). It swaps the right-hand side of the chunk with the left-hand side, using the verb as the pivot point. It uses the first_chunk_index() function defined in the previous recipe to find the verb to pivot around. def swap_verb_phrase(chunk): # find location of verb vbpred = lambda (word, tag): tag != 'VBG' and tag.startswith('VB') and len(tag) > 2 vbidx = first_chunk_index(chunk, vbpred) if vbidx is None: return chunk return chunk[vbidx+1:] + chunk[:vbidx] Now we can see how it works on the part-of-speech tagged phrase "the book was great". >>> from transforms import swap_verb_phrase >>> swap_verb_phrase([('the', 'DT'), ('book', 'NN'), ('was', 'VBD'), ('great', 'JJ')]) [('great', 'JJ'), ('the', 'DT'), ('book', 'NN')] The result is "great the book". This phrase clearly isn't grammatically correct, so read on to learn how to fix it. How it works... Using first_chunk_index() from the previous recipe, we start by finding the first matching verb that is not a gerund (a word that ends in "ing") tagged with VBG. Once we've found the verb, we return the chunk with the right side before the left, and remove the verb. The reason we don't want to pivot around a gerund is that gerunds are commonly used to describe nouns, and pivoting around one would remove that description. Here's an example where you can see how not pivoting around a gerund is a good thing: >>> swap_verb_phrase([('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic', 'JJ')]) [('fantastic', 'JJ'), ('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN')] If we had pivoted around the gerund, the result would be "book is fantastic this", and we'd lose the gerund "gripping". There's more... Filtering insignificant words makes the final result more readable. By filtering either before or after swap_verb_phrase(), we get "fantastic gripping book" instead of "fantastic this gripping book". >>> from transforms import swap_verb_phrase, filter_insignificant >>> swap_verb_phrase(filter_insignificant([('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic', 'JJ')])) [('fantastic', 'JJ'), ('gripping', 'VBG'), ('book', 'NN')] >>> filter_insignificant(swap_verb_phrase([('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic', 'JJ')])) [('fantastic', 'JJ'), ('gripping', 'VBG'), ('book', 'NN')] Either way, we get a shorter grammatical chunk with no loss of meaning.
Read more
  • 0
  • 0
  • 3870
article-image-ssis-applications-using-sql-azure
Packt
14 Dec 2010
5 min read
Save for later

SSIS Applications using SQL Azure

Packt
14 Dec 2010
5 min read
Microsoft SQL Azure Enterprise Application Development Build enterprise-ready applications and projects with SQL Azure Develop large scale enterprise applications using Microsoft SQL Azure Understand how to use the various third party programs such as DB Artisan, RedGate, ToadSoft etc developed for SQL Azure Master the exhaustive Data migration and Data Synchronization aspects of SQL Azure. Includes SQL Azure projects in incubation and more recent developments including all 2010 updates SSIS and SSRS are not presently supported on SQL Azure. However, this is one of the future enhancements that will be implemented. While they are not supported on Windows Azure platform, they can be used to carry out both data integration and data reporting activities. Moving a MySQL database to SQL Azure database Realizing the growing importance of MySQL and PHP from the LAMP stack, Microsoft has started providing programs to interact with and leverage these programs. For example, the SSMA described previously and third-party language hook ups to Windows Azure are just the beginning. For small businesses who are now using MySQL and who might be contemplating to move to SQL Azure, migration of data becomes important. In the following section, we develop a SQL Server Integration Services package, which when executed transfers a table from MySQL to SQL Azure. Creating the package The package consists of a dataflow task that extracts table data from MySQL (source) and transfers it to SQL Azure (destination). The dataflow task consists of an ADO.NET Source connecting to MySQL and an ADO.NET Destination connecting to SQL Azure. In the next section, the method for creating the two connections is explained. Creating the source and destination connections In order to create the package we need a connection to MySQL and a connection to SQL Azure. We use the ADO.NET Source and ADO.NET Destination for the flow of the data. In order to create an ADO.NET Source connection to MySQL we need to create an ODBC DSN as we will be using the .NET ODBC Data Provider. Details of creating an ODBC DSN for the version of MySQL are described here: http://www.packtpub.com/article/mysql-linked-server-on-sql-server-2008. Configuring a Connection Manager for MySQL is described here: http://www.packtpub.com/article/mysql-data-transfer-using-sql-server-integration-servicesssis. The Connection Manager for SQL Azure Destination uses a .NET SQLClient Data Provider and this is described here (when SQL Azure was in CTP but no change is required for the RTM): . The authentication information needs to be substituted for the current SQL Azure database. Note that these procedures are not repeated step-by-step as they are described in great detail in the referenced links. However some key features of the configuration details are presented here: The ODBC DSN created is shown here with the details: The settings used for the MySQL Connection Manager are the following:Provider: .NET ProvidersOdbc Data Provider Data Source Specification Use user or system data source name: MySqlData Login Information: root Password: <root password> The settings for the SQL Azure are the following:Provider: .Net ProvidersSQLClient Data Provider Server name: xxxxxxx.database.windows.net Log on to the server Use SQL Server authentication User name: mysorian Password: ******** Connect to a database Select or enter database name: Bluesky (if authentication is correct, it should appear in the drop-down) Creating the package We begin with the source connection and after configuring the Connection Manager, by editing the source as shown in the following screenshot. You may notice that the SQL command is used rather than the name of the table. It was found however, that choosing the name of the table results in an error. Probably a bug, and as a workaround we use the SQL command. With this you can preview the data and verify it. After verifying the data from the source, drag-and-drop the green dangling line from the source to the ADO.NET Destination component connected to SQL Azure. Double-clicking the destination component brings up the ADO.NET Destination Editor with the following details: Connection manager: XXXXXXXXX.database.windows.net.Bluesky.mysorian2 Use a table or view: "dbo"."AzureEmployees" Use Bulk Insert when possible: checked There will be a warning message at the bottom of screen: Map the columns on the Mappings page. The ADO.NET Destination Editor window comes up with a list of tables or views displaying one of the tables. We will be creating a new table. Clicking New… button for the field Use a table or view brings up the Create Table window with a default create table statement with all the columns from the source table and a default table name, ADO.NET Destination. Modify the create table statement as follows: CREATE TABLE fromMySql( "Id" int Primary Key Clustered, "Month" nvarchar(11), "Temperature" float, "RecordHigh" float When you click on OK in this screen you will have completed the configuration of the destination. There are several things you can add to make troubleshooting easier by adding Data Viewers, error handling, and so on. These are omitted here but best practices require that these should be in place when you design packages. The completed destination component should display the following details: Connection manager: XXXXXXX.database.windows.net.Bluesky.mysorian2 Use a table or view: fromMySql Use Bulk Insert when possible: Checked The columns from the source are all mapped to the columns of the destination, which can be verified in the Mappings page, as shown in the following screenshot: When the source and destination are completely configured as described here you can build the project from the main menu. When you execute the project, the program starts running and after a while both the components turn yellow and then go green indicating that the package has executed successfully. The rows (number) that are written to the destination also appear in the designer. You may now log on to SQL Azure in SSMS and verify that the table fromMySql2 has been created and that 12 rows of data from MySQL's data have been written into it.
Read more
  • 0
  • 0
  • 2253

article-image-python-graphics-animation-principles
Packt
01 Dec 2010
7 min read
Save for later

Python graphics: animation principles

Packt
01 Dec 2010
7 min read
Animation is about making graphic objects move smoothly around a screen. The method to create the sensation of smooth dynamic action is simple: First present a picture to the viewer's eye. Allow the image to stay in view for about one-twentieth of a second. With a minimum of delay, present another picture where objects have been shifted by a small amount and repeat the process. Besides the obvious applications of making animated figures move around on a screen for entertainment, animating the results of computer code gives you powerful insights into how code works at a detailed level. Animation offers an extra dimension to the programmers' debugging arsenal. It provides you with an all encompassing, holistic view of software execution in progress that nothing else can. Static shifting of a ball with Python We make an image of a small colored disk and draw it in a sequence of different positions. How to do it... Execute the program shown and you will see a neat row of colored disks laid on top of each other going from top left to bottom right. The idea is to demonstrate the method of systematic position shifting. # moveball_1.py #>>>>>>>>>>>>> from Tkinter import * root = Tk() root.title("shifted sequence") cw = 250 # canvas width ch = 130 # canvas height chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) # The parameters determining the dimensions of the ball and its # position. # ===================================== posn_x = 1 # x position of box containing the ball (bottom) posn_y = 1 # y position of box containing the ball (left edge) shift_x = 3 # amount of x-movement each cycle of the 'for' loop shift_y = 2 # amount of y-movement each cycle of the 'for' loop ball_width = 12 # size of ball - width (x-dimension) ball_height = 12 # size of ball - height (y-dimension) color = "violet" # color of the ball for i in range(1,50): # end the program after 50 position shifts posn_x += shift_x posn_y += shift_y chart_1.create_oval(posn_x, posn_y, posn_x + ball_width, posn_y + ball_height, fill=color) root.mainloop() #>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How it works... A simple ball is drawn on a canvas in a sequence of steps, one on top of the other. For each step, the position of the ball is shifted by three pixels as specified by the size of shift_x. Similarly, a downward shift of two pixels is applied by an amount to the value of shift_y. shift_x and shift_y only specify the amount of shift, but they do not make it happen. What makes it happen are the two commands posn_x += shift_x and posn_y += shift_y. posn is the abbreviation for position. posn_x += shift_x means "take the variable posn_x and add to it an amount shift_x." It is the same as posn_x = posn_x + shift_x. Another minor point to note is the use of the line continuation character, the backslash "". We use this when we want to continue the same Python command onto a following line to make reading easier. Strictly speaking for text inside brackets "(...)" this is not needed. In this particular case you can just insert a carriage return character. However, the backslash makes it clear to anyone reading your code what your intention is. There's more... The series of ball images in this recipe were drawn in a few microseconds. To create decent looking animation, we need to be able to slow the code execution down by just the right amount. We need to draw the equivalent of a movie frame onto the screen and keep it there for a measured time and then move on to the next, slightly shifted, image. This is done in the next recipe. Time-controlled shifting of a ball Here we introduce the time control function canvas.after(milliseconds) and the canvas.update() function that refreshes the image on the canvas. These are the cornerstones of animation in Python. Control of when code gets executed is made possible by the time module that comes with the standard Python library. How to do it... Execute the program as previously. What you will see is a diagonal row of disks being laid in a line with a short delay of one fifth of a second (200 milliseconds) between updates. The result is shown in the following screenshot showing the ball shifting in regular intervals. # timed_moveball_1.py #>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * root = Tk() root.title("Time delayed ball drawing") cw = 300 # canvas width ch = 130 # canvas height chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 200 # time between fresh positions of the ball # (milliseconds). # The parameters determining the dimensions of the ball and it's # position. posn_x = 1 # x position of box containing the ball (bottom). posn_y = 1 # y position of box containing the ball (left edge). shift_x = 3 # amount of x-movement each cycle of the 'for' loop. shift_y = 3 # amount of y-movement each cycle of the 'for' loop. ball_width = 12 # size of ball - width (x-dimension). ball_height = 12 # size of ball - height (y-dimension). color = "purple" # color of the ball for i in range(1,50): # end the program after 50 position shifts. posn_x += shift_x posn_y += shift_y chart_1.create_oval(posn_x, posn_y, posn_x + ball_width, posn_y + ball_height, fill=color) chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 # milliseconds. root.mainloop() How it works... This recipe is the same as the previous one except for the canvas.after(...) and the canvas.update() methods. These are two functions that come from the Python library. The first gives you some control over code execution time by allowing you to specify delays in execution. The second forces the canvas to be completely redrawn with all the objects that should be there. There are more complicated ways of refreshing only portions of the screen, but they create difficulties so they will not be dealt with here. The canvas.after(your-chosen-milliseconds) method simply causes a timed-pause to the execution of the code. In all the preceding code, the pause is executed as fast as the computer can do it, then when the pause, invoked by the canvas.after() method is encountered, execution simply gets suspended for the specified number of milliseconds. At the end of the pause, execution continues as if nothing ever happened. The canvas.update() method forces everything on the canvas to be redrawn immediately rather than wait for some unspecified event to cause the canvas to be refreshed. There's more... The next step in effective animation is to erase the previous image of the object being animated shortly before a fresh, shifted clone is drawn on the canvas. This happens in the next example. The robustness of Tkinter It is also worth noting that Tkinter is robust. When you give position coordinates that are off the canvas, Python does not crash or freeze. It simply carries on drawing the object 'off-the-page'. The Tkinter canvas can be seen as just a tiny window into an almost unlimited universe of visual space. We only see objects when they move into the view of the camera which is the Tkinter canvas.
Read more
  • 0
  • 0
  • 12819

article-image-animating-graphic-objects-using-python
Packt
01 Dec 2010
9 min read
Save for later

Animating Graphic Objects using Python

Packt
01 Dec 2010
9 min read
Python 2.6 Graphics Cookbook Over 100 great recipes for creating and animating graphics using Python Create captivating graphics with ease and bring them to life using Python Apply effects to your graphics using powerful Python methods Develop vector as well as raster graphics and combine them to create wonders in the animation world Create interactive GUIs to make your creation of graphics simpler Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to accomplish the task of creation and animation of graphics as efficiently as possible        Precise collisions using floating point numbers Here the simulation flaws caused by the coarseness of integer arithmetic are eliminated by using floating point numbers for all ball position calculations. How to do it... All position, velocity, and gravity variables are made floating point by writing them with explicit decimal points. The result is shown in the following screenshot, showing the bouncing balls with trajectory tracing. from Tkinter import * root = Tk() root.title("Collisions with Floating point") cw = 350 # canvas width ch = 200 # canvas height GRAVITY = 1.5 chart_1 = Canvas(root, width=cw, height=ch, background="black") chart_1.grid(row=0, column=0) cycle_period = 80 # Time between new positions of the ball # (milliseconds). time_scaling = 0.2 # This governs the size of the differential steps # when calculating changes in position. # The parameters determining the dimensions of the ball and it's # position. ball_1 = {'posn_x':25.0, # x position of box containing the # ball (bottom). 'posn_y':180.0, # x position of box containing the # ball (left edge). 'velocity_x':30.0, # amount of x-movement each cycle of # the 'for' loop. 'velocity_y':100.0, # amount of y-movement each cycle of # the 'for' loop. 'ball_width':20.0, # size of ball - width (x-dimension). 'ball_height':20.0, # size of ball - height (y-dimension). 'color':"dark orange", # color of the ball 'coef_restitution':0.90} # proportion of elastic energy # recovered each bounce ball_2 = {'posn_x':cw - 25.0, 'posn_y':300.0, 'velocity_x':-50.0, 'velocity_y':150.0, 'ball_width':30.0, 'ball_height':30.0, 'color':"yellow3", 'coef_restitution':0.90} def detectWallCollision(ball): # Collision detection with the walls of the container if ball['posn_x'] > cw - ball['ball_width']: # Collision # with right-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] # reverse direction. ball['posn_x'] = cw - ball['ball_width'] if ball['posn_x'] < 1: # Collision with left-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = 2 # anti-stick to the wall if ball['posn_y'] < ball['ball_height'] : # Collision # with ceiling. ball['velocity_y'] = -ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ball['ball_height'] if ball['posn_y'] > ch - ball['ball_height']: # Floor # collision. ball['velocity_y'] = - ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ch - ball['ball_height'] def diffEquation(ball): # An approximate set of differential equations of motion # for the balls ball['posn_x'] += ball['velocity_x'] * time_scaling ball['velocity_y'] = ball['velocity_y'] + GRAVITY # a crude # equation incorporating gravity. ball['posn_y'] += ball['velocity_y'] * time_scaling chart_1.create_oval( ball['posn_x'], ball['posn_y'], ball['posn_x'] + ball['ball_width'], ball ['posn_y'] + ball['ball_height'], fill= ball['color']) detectWallCollision(ball) # Has the ball collided with # any container wall? for i in range(1,2000): # end the program after 1000 position shifts. diffEquation(ball_1) diffEquation(ball_2) chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 # milliseconds. chart_1.delete(ALL) # This erases everything on the root.mainloop() How it works... Use of precision arithmetic has allowed us to notice simulation behavior that was previously hidden by the sins of integer-only calculations. This is the UNIQUE VALUE OF GRAPHIC SIMULATION AS A DEBUGGING TOOL. If you can represent your ideas in a visual way rather than as lists of numbers you will easily pick up subtle quirks in your code. The human brain is designed to function best in graphical images. It is a direct consequence of being a hunter. A graphic debugging tool... There is another very handy trick in the software debugger's arsenal and that is the visual trace. A trace is some kind of visual trail that shows the history of dynamic behavior. All of this is revealed in the next example. Trajectory tracing and ball-to-ball collisions Now we introduce one of the more difficult behaviors in our simulation of ever increasing complexity – the mid-air collision. The hardest thing when you are debugging a program is to try to hold in your short term memory some recently observed behavior and compare it meaningfully with present behavior. This kind of memory is an imperfect recorder. The way to overcome this is to create a graphic form of memory – some sort of picture that shows accurately what has been happening in the past. In the same way that military cannon aimers use glowing tracer projectiles to adjust their aim, a graphic programmer can use trajectory traces to examine the history of execution. How to do it... In our new code there is a new function called detect_ball_collision (ball_1, ball_2) whose job is to anticipate imminent collisions between the two balls no matter where they are. The collisions will come from any direction and therefore we need to be able to test all possible collision scenarios and examine the behavior of each one and see if it does not work as planned. This can be too difficult unless we create tools to test the outcome. In this recipe, the tool for testing outcomes is a graphic trajectory trace. It is a line that trails behind the path of the ball and shows exactly where it went right since the beginning of the simulation. The result is shown in the following screenshot, showing the bouncing with ball-to-ball collision rebounds. # kinetic_gravity_balls_1.py # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * import math root = Tk() root.title("Balls bounce off each other") cw = 300 # canvas width ch = 200 # canvas height GRAVITY = 1.5 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 80 # Time between new positions of the ball # (milliseconds). time_scaling = 0.2 # The size of the differential steps # The parameters determining the dimensions of the ball and its # position. ball_1 = {'posn_x':25.0, 'posn_y':25.0, 'velocity_x':65.0, 'velocity_y':50.0, 'ball_width':20.0, 'ball_height':20.0, 'color':"SlateBlue1", 'coef_restitution':0.90} ball_2 = {'posn_x':180.0, 'posn_y':ch- 25.0, 'velocity_x':-50.0, 'velocity_y':-70.0, 'ball_width':30.0, 'ball_height':30.0, 'color':"maroon1", 'coef_restitution':0.90} def detect_wall_collision(ball): # detect ball-to-wall collision if ball['posn_x'] > cw - ball['ball_width']: # Right-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = cw - ball['ball_width'] if ball['posn_x'] < 1: # Left-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = 2 if ball['posn_y'] < ball['ball_height'] : # Ceiling. ball['velocity_y'] = -ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ball['ball_height'] if ball['posn_y'] > ch - ball['ball_height'] : # Floor ball['velocity_y'] = - ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ch - ball['ball_height'] def detect_ball_collision(ball_1, ball_2): #detect ball-to-ball collision # firstly: is there a close approach in the horizontal direction if math.fabs(ball_1['posn_x'] - ball_2['posn_x']) < 25: # secondly: is there also a close approach in the vertical # direction. if math.fabs(ball_1['posn_y'] - ball_2['posn_y']) < 25: ball_1['velocity_x'] = -ball_1['velocity_x'] # reverse # direction. ball_1['velocity_y'] = -ball_1['velocity_y'] ball_2['velocity_x'] = -ball_2['velocity_x'] ball_2['velocity_y'] = -ball_2['velocity_y'] # to avoid internal rebounding inside balls ball_1['posn_x'] += ball_1['velocity_x'] * time_scaling ball_1['posn_y'] += ball_1['velocity_y'] * time_scaling ball_2['posn_x'] += ball_2['velocity_x'] * time_scaling ball_2['posn_y'] += ball_2['velocity_y'] * time_scaling def diff_equation(ball): x_old = ball['posn_x'] y_old = ball['posn_y'] ball['posn_x'] += ball['velocity_x'] * time_scaling ball['velocity_y'] = ball['velocity_y'] + GRAVITY ball['posn_y'] += ball['velocity_y'] * time_scaling chart_1.create_oval( ball['posn_x'], ball['posn_y'], ball['posn_x'] + ball['ball_width'], ball['posn_y'] + ball['ball_height'], fill= ball['color'], tags="ball_tag") chart_1.create_line( x_old, y_old, ball['posn_x'], ball ['posn_y'], fill= ball['color']) detect_wall_collision(ball) # Has the ball # collided with any container wall? for i in range(1,5000): diff_equation(ball_1) diff_equation(ball_2) detect_ball_collision(ball_1, ball_2) chart_1.update() chart_1.after(cycle_period) chart_1.delete("ball_tag") # Erase the balls but # leave the trajectories root.mainloop() How it works... Mid-air ball against ball collisions are done in two steps. In the first step, we test whether the two balls are close to each other inside a vertical strip defined by if math.fabs(ball_1['posn_x'] - ball_2['posn_x']) < 25. In plain English, this asks "Is the horizontal distance between the balls less than 25 pixels?" If the answer is yes, then the region of examination is narrowed down to a small vertical distance less than 25 pixels by the statement if math.fabs(ball_1['posn_y'] - ball_2['posn_y']) < 25. So every time the loop is executed, we sweep the entire canvas to see if the two balls are both inside an area where their bottom-left corners are closer than 25 pixels to each other. If they are that close then we simply cause a rebound off each other by reversing their direction of travel in both the horizontal and vertical directions. There's more... Simply reversing the direction is not the mathematically correct way to reverse the direction of colliding balls. Certainly billiard balls do not behave that way. The law of physics that governs colliding spheres demands that momentum be conserved. Why do we sometimes get tkinter.TckErrors? If we click the close window button (the X in the top right) while Python is paused, when Python revives and then calls on Tcl (Tkinter) to draw something on the canvas we will get an error message. What probably happens is that the application has already shut down, but Tcl has unfinished business. If we allow the program to run to completion before trying to shut the window then termination is orderly.
Read more
  • 0
  • 0
  • 8314
article-image-python-graphics-combining-raster-and-vector-pictures
Packt
23 Nov 2010
12 min read
Save for later

Python Graphics: Combining Raster and Vector Pictures

Packt
23 Nov 2010
12 min read
  Python 2.6 Graphics Cookbook Over 100 great recipes for creating and animating graphics using Python Create captivating graphics with ease and bring them to life using Python Apply effects to your graphics using powerful Python methods Develop vector as well as raster graphics and combine them to create wonders in the animation world Create interactive GUIs to make your creation of graphics simpler Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to accomplish the task of creation and animation of graphics as efficiently as possible Because we are not altering and manipulating the actual properties of the images we do not need the Python Imaging Library (PIL) in this chapter. We need to work exclusively with GIF format images because that is what Tkinter deals with. We will also see how to use "The GIMP" as a tool to prepare images suitable for animation. Simple animation of a GIF beach ball We want to animate a raster image, derived from a photograph. To keep things simple and clear we are just going to move a photographic image (in GIF format) of a beach ball across a black background. Getting ready We need a suitable GIF image of an object that we want to animate. An example of one, named beachball.gif has been provided. How to do it... Copy a .gif fle from somewhere and paste it into a directory where you want to keep your work-in-progress pictures. Ensure that the path in our computer's fle system leads to the image to be used. In the example below, the instruction, ball = PhotoImage(file = "constr/pics2/beachball.gif") says that the image to be used will be found in a directory (folder) called pics2, which is a sub-folder of another folder called constr. Then execute the following code. # photoimage_animation_1.py #>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * root = Tk() cycle_period = 100 cw = 300 # canvas width ch = 200 # canvas height canvas_1 = Canvas(root, width=cw, height=ch, bg="black") canvas_1.grid(row=0, column=1) posn_x = 10 posn_y = 10 shift_x = 2 shift_y = 1 ball = PhotoImage(file = "/constr/pics2/beachball.gif") for i in range(1,100): # end the program after 100 position shifts. posn_x += shift_x posn_y += shift_y canvas_1.create_image(posn_x,posn_y,anchor=NW, image=ball) canvas_1.update() # This refreshes the drawing on the canvas. canvas_1.after(cycle_period) # This makes execution pause for 100 milliseconds. canvas_1.delete(ALL) # This erases everything on the canvas. root.mainloop() How it Works The image of the beach ball is shifted across a canvas. The photo type images always occupy a rectangular area of screen. The size of this box, called the bounding box, is the size of the image. We have used a black background so the black corners on the image of our beach ball cannot be seen. The vector walking creature We make a pair of walking legs using the vector graphics. We want to use these legs together with pieces of raster images and see how far we can go in making appealing animations. We import Tkinter, math, and time modules. The math is needed to provide the trigonometry that sustains the geometric relations that move the parts of the leg in relation to each other. Getting ready We will be using Tkinter and time modules to animate the movement of lines and circles. You will see some trigonometry in the code. If you do not like mathematics you can just cut and paste the code without the need to understand exactly how the maths works. However, if you are a friend of mathematics it is fun to watch sine, cosine, and tangent working together to make a child smile. How to do it... Execute the program as shown in the previous image. # walking_creature_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("The thing that Strides") cw = 400 # canvas width ch = 100 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 100 # time between new positions of the ball (milliseconds). base_x = 20 base_y = 100 hip_h = 40 thy = 20 #=============================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] # The merging of the separate x and y lists into a single sequence. #================================== # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" . This returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2((y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) if L1/2 < thy: # The sign of alpha determines which way the knees bend. alpha = -math.acos(L1/(2*thy)) # Avian #alpha = math.acos(L1/(2*thy)) # Mammalian else: alpha = 0.0 theta_2 = alpha + theta_1 x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 milliseconds. chart_1.delete(ALL) # This erases *almost* everything on the canvas. # Does not delete the text from inside a function. bx_stay = base_x by_stay = base_y for j in range(0,11): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] chart_1.create_line(cx0, cy0 ,cx1 ,cy1) chart_1.create_oval(cx1-10 ,cy1-10 ,cx1+10 ,cy1+10, fill="orange") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - step_y[k] ay1 = base_y - step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - step_y[k-aa] by1 = base_y - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_oval(ax_stay-5 ,ay1-5 ,ax1+5 ,ay1+5, fill="green") chart_1.create_oval(bx_stay-5 ,by_stay-5 ,bx_stay+5 ,by_stay+5, fill="blue") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") animdelay() root.mainloop() How it works... Without getting bogged down in detail, the strategy in the program consists of defning the motion of a foot while walking one stride. This motion is defned by eight relative positions given by the two lists step_x (horizontal) and step_y (vertical). The motion of the hips is given by a separate pair of x- and y-positions hip_x and hip_y. Trigonometry is used to work out the position of the knee on the assumption that the thigh and lower leg are the same length. The calculation is based on the sine rule taught in high school. Yes, we do learn useful things at school! The time-animation regulation instructions are assembled together as a function animdelay(). There's more In Python math module, two arc-tangent functions are available for calculating angles given the lengths of two adjacent sides. atan2(y,x) is the best because it takes care of the crazy things a tangent does on its way around a circle - tangent ficks from minus infnity to plus infnity as it passes through 90 degrees and any multiples thereof. A mathematical knee is quite happy to bend forward or backward in satisfying its equations. We make the sign of the angle negative for a backward-bending bird knee and positive for a forward bending mammalian knee. More Info Section 1 This animated walking hips-and-legs is used in the recipes that follow this to make a bird walk in the desert, a diplomat in palace grounds, and a spider in a forest. Bird with shoes walking in the Karroo We now coordinate the movement of four GIF images and the striding legs to make an Apteryx (a fightless bird like the kiwi) that walks. Getting ready We need the following GIF images: A background picture of a suitable landscape A bird body without legs A pair of garish-colored shoes to make the viewer smile The walking avian legs of the previous recipe The images used are karroo.gif, apteryx1.gif, and shoe1.gif. Note that the images of the bird and the shoe have transparent backgrounds which means there is no rectangular background to be seen surrounding the bird or the shoe. In the recipe following this one, we will see the simplest way to achieve the necessary transparency. How to do it... Execute the program shown in the usual way. # walking_birdy_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("A Walking birdy gif and shoes images") cw = 800 # canvas width ch = 200 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 80 # time between new positions of the ball (milliseconds). im_backdrop = "/constr/pics1/karoo.gif" im_bird = "/constr/pics1/apteryx1.gif" im_shoe = "/constr/pics1/shoe1.gif" birdy =PhotoImage(file= im_bird) shoey =PhotoImage(file= im_shoe) backdrop = PhotoImage(file= im_backdrop) chart_1.create_image(0 ,0 ,anchor=NW, image=backdrop) base_x = 20 base_y = 190 hip_h = 70 thy = 60 #========================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] #============================================= # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" this returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2(-(y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) alpha = math.atan2(hip_h,L1) theta_2 = -(theta_1 - alpha) x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # Refresh the drawing on the canvas. chart_1.after(cycle_period) # Pause execution pause for X millise-conds. chart_1.delete("walking") # Erases everything on the canvas. bx_stay = base_x by_stay = base_y for j in range(0,13): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] #chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - 10 - step_y[k] ay1 = base_y - 10 -step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - 10 - step_y[k-aa] by1 = base_y - 10 - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 chart_1.create_image(ax_stay-5 ,ay_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") chart_1.create_image(bx_stay-5 ,by_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay-15 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay-15 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") animdelay() root.mainloop() How it works... The same remarks concerning the trigonometry made in the previous recipe apply here. What we see here now is the ease with which vector objects and raster images can be combined once suitable GIF images have been prepared. There's more... For teachers and their students who want to make lessons on a computer, these techniques offer all kinds of possibilities like history tours and re-enactments, geography tours, and, science experiments. Get the students to do projects telling stories. Animated year books?
Read more
  • 0
  • 0
  • 5142

article-image-parsing-specific-data-python-text-processing
Packt
23 Nov 2010
12 min read
Save for later

Parsing Specific Data in Python Text Processing

Packt
23 Nov 2010
12 min read
Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible  Introduction This article covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries for accomplishing this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to the NLTK: dateutil: Provides date/time parsing and time zone conversion timex: Can identify time words in text lxml and BeautifulSoup: Can parse, clean, and convert HTML chardet: Detects the character encoding of text The libraries can be useful for pre-processing text before passing it to an NLTK object, or post-processing text that has been processed and extracted using NLTK. Here's an example that ties many of these tools together. Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once you have the article text, you can use chardet to ensure it's UTF-8 before cleaning out the HTML and running it through NLTK-based part-of-speech tagging, chunk extraction, and/or text classification, to create additional metadata about the article. If there's an event happening at the restaurant, you may be able to discover that by looking at the time words identified by timex. The point of this example is that real-world text processing often requires more than just NLTK-based natural language processing, and the functionality covered in this article can help with those additional requirements. Parsing dates and times with Dateutil If you need to parse dates and times in Python, there is no better library than dateutil. The parser module can parse datetime strings in many more formats than can be shown here, while the tz module provides everything you need for looking up time zones. Combined, these modules make it quite easy to parse strings into time zone aware datetime objects. Getting ready You can install dateutil using pip or easy_install, that is sudo pip install dateutil or sudo easy_install dateutil. Complete documentation can be found at http://labix.org/python-dateutil How to do it... Let's dive into a few parsing examples: >>> from dateutil import parser >>> parser.parse('Thu Sep 25 10:36:28 2010') datetime.datetime(2010, 9, 25, 10, 36, 28) >>> parser.parse('Thursday, 25. September 2010 10:36AM') datetime.datetime(2010, 9, 25, 10, 36) >>> parser.parse('9/25/2010 10:36:28') datetime.datetime(2010, 9, 25, 10, 36, 28) >>> parser.parse('9/25/2010') datetime.datetime(2010, 9, 25, 0, 0) >>> parser.parse('2010-09-25T10:36:28Z') datetime.datetime(2010, 9, 25, 10, 36, 28, tzinfo=tzutc()) As you can see, all it takes is importing the parser module and calling the parse() function with a datetime string. The parser will do its best to return a sensible datetime object, but if it cannot parse the string, it will raise a ValueError. How it works... The parser does not use regular expressions. Instead, it looks for recognizable tokens and does its best to guess what those tokens refer to. The order of these tokens matters, for example, some cultures use a date format that looks like Month/Day/Year (the default order) while others use a Day/Month/Year format. To deal with this, the parse() function takes an optional keyword argument dayfirst, which defaults to False. If you set it to True, it can correctly parse dates in the latter format. >>> parser.parse('25/9/2010', dayfirst=True) datetime.datetime(2010, 9, 25, 0, 0) Another ordering issue can occur with two-digit years. For example, '10-9-25' is ambiguous. Since dateutil defaults to the Month-Day-Year format, '10-9-25' is parsed to the year 2025. But if you pass yearfirst=True into parse(), it will be parsed to the year 2010. >>> parser.parse('10-9-25') datetime.datetime(2025, 10, 9, 0, 0) >>> parser.parse('10-9-25', yearfirst=True) datetime.datetime(2010, 9, 25, 0, 0) There's more... The dateutil parser can also do fuzzy parsing, which allows it to ignore extraneous characters in a datetime string. With the default value of False, parse() will raise a ValueError when it encounters unknown tokens. But if fuzzy=True, then a datetime object can usually be returned. >>> try: ... parser.parse('9/25/2010 at about 10:36AM') ... except ValueError: ... 'cannot parse' 'cannot parse' >>> parser.parse('9/25/2010 at about 10:36AM', fuzzy=True) datetime.datetime(2010, 9, 25, 10, 36) Time zone lookup and conversion Most datetime objects returned from the dateutil parser are naive, meaning they don't have an explicit tzinfo, which specifies the time zone and UTC offset. In the previous recipe, only one of the examples had a tzinfo, and that's because it's in the standard ISO format for UTC date and time strings. UTC is the coordinated universal time, and is the same as GMT. ISO is the International Standards Organization, which among other things, specifies standard date and time formatting. Python datetime objects can either be naive or aware. If a datetime object has a tzinfo, then it is aware. Otherwise the datetime is naive. To make a naive datetime object time one aware, you must give it an explicit tzinfo. However, the Python datetime library only defines an abstract base class for tzinfo, and leaves it up to the others to actually implement tzinfo creation. This is where the tz module of dateutil comes in—it provides everything you need to lookup time zones from your OS time zone data. Getting ready dateutil should be installed using pip or easy_install. You should also make sure your operating system has time zone data. On Linux, this is usually found in /usr/share/zoneinfo, and the Ubuntu package is called tzdata. If you have a number of files and directories in /usr/share/zoneinfo, such as America/, Europe/, and so on, then you should be ready to proceed. The following examples show directory paths for Ubuntu Linux. How to do it... Let's start by getting a UTC tzinfo object. This can be done by calling tz.tzutc(), and you can check that the offset is 0 by calling the utcoffset() method with a UTC datetime object. >>> from dateutil import tz >>> tz.tzutc() tzutc() >>> import datetime >>> tz.tzutc().utcoffset(datetime.datetime.utcnow()) datetime.timedelta(0) To get tzinfo objects for other time zones, you can pass in a time zone file path to the gettz() function. >>> tz.gettz('US/Pacific') tzfile('/usr/share/zoneinfo/US/Pacific') >>> tz.gettz('US/Pacific').utcoffset(datetime.datetime.utcnow()) datetime.timedelta(-1, 61200) >>> tz.gettz('Europe/Paris') tzfile('/usr/share/zoneinfo/Europe/Paris') >>> tz.gettz('Europe/Paris').utcoffset(datetime.datetime.utcnow()) datetime.timedelta(0, 7200) You can see the UTC offsets are timedelta objects, where the first number is days, and the second number is seconds. If you're storing datetimes in a database, it's a good idea to store them all in UTC to eliminate any time zone ambiguity. Even if the database can recognize time zones, it's still a good practice. To convert a non-UTC datetime object to UTC, it must be made time zone aware. If you try to convert a naive datetime to UTC, you'll get a ValueError exception. To make a naive datetime time zone aware, you simply call the replace() method with the correct tzinfo. Once a datetime object has a tzinfo, then UTC conversion can be performed by calling the astimezone() method with tz.tzutc(). >>> pst = tz.gettz('US/Pacific') >>> dt = datetime.datetime(2010, 9, 25, 10, 36) >>> dt.tzinfo >>> dt.astimezone(tz.tzutc()) Traceback (most recent call last): File "/usr/lib/python2.6/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "<doctest __main__[22]>", line 1, in <module> dt.astimezone(tz.tzutc()) ValueError: astimezone() cannot be applied to a naive datetime >>> dt.replace(tzinfo=pst) datetime.datetime(2010, 9, 25, 10, 36, tzinfo=tzfile('/usr/share/ zoneinfo/US/Pacific')) >>> dt.replace(tzinfo=pst).astimezone(tz.tzutc()) datetime.datetime(2010, 9, 25, 17, 36, tzinfo=tzutc()) How it works... The tzutc and tzfile objects are both subclasses of tzinfo. As such, they know the correct UTC offset for time zone conversion (which is 0 for tzutc). A tzfile object knows how to read your operating system's zoneinfo files to get the necessary offset data. The replace() method of a datetime object does what its name implies—it replaces attributes. Once a datetime has a tzinfo, the astimezone() method will be able to convert the time using the UTC offsets, and then replace the current tzinfo with the new tzinfo. Note that both replace() and astimezone() return new datetime objects. They do not modify the current object. There's more... You can pass a tzinfos keyword argument into the dateutil parser to detect otherwise unrecognized time zones. >>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)', fuzzy=True) datetime.datetime(2010, 8, 4, 18, 30) >>> tzinfos = {'CDT': tz.gettz('US/Central')} >>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)', fuzzy=True, tzinfos=tzinfos) datetime.datetime(2010, 8, 4, 18, 30, tzinfo=tzfile('/usr/share/ zoneinfo/US/Central')) In the first instance, we get a naive datetime since the time zone is not recognized. However, when we pass in the tzinfos mapping, we get a time zone aware datetime. Local time zone If you want to lookup your local time zone, you can call tz.tzlocal(), which will use whatever your operating system thinks is the local time zone. In Ubuntu Linux, this is usually specified in the /etc/timezone file. Custom offsets You can create your own tzinfo object with a custom UTC offset using the tzoffset object. A custom offset of one hour can be created as follows: >>> tz.tzoffset('custom', 3600) tzoffset('custom', 3600) You must provide a name as the first argument, and the offset time in seconds as the second argument. Tagging temporal expressions with Timex The NLTK project has a little known contrib repository that contains, among other things, a module called timex.py that can tag temporal expressions. A temporal expression is just one or more time words, such as "this week", or "next month". These are ambiguous expressions that are relative to some other point in time, like when the text was written. The timex module provides a way to annotate text so these expressions can be extracted for further analysis. More on TIMEX can be found at http://timex2.mitre.org/ Getting ready The timex.py module is part of the nltk_contrib package, which is separate from the current version of NLTK. This means you need to install it yourself, or use the timex.py module. You can also download timex.py directly from http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/timex.py If you want to install the entire nltk_contrib package, you can check out the source at http://nltk.googlecode.com/svn/trunk/ and do sudo python setup.py install from within the nltk_contrib folder. If you do this, you'll need to do from nltk_contrib import timex instead of just import timex as done in the following How to do it... section. For this recipe, you have to download the timex.py module into the same folder as the rest of the code, so that import timex does not cause an ImportError. You'll also need to get the egenix-mx-base package installed. This is a C extension library for Python, so if you have all the correct Python development headers installed, you should be able to do sudo pip install egenix-mx-base or sudo easy_install egenix-mxbase. If you're running Ubuntu Linux, you can instead do sudo apt-get install pythonegenix-mxdatetime. If none of those work, you can go to http://www.egenix.com/products/python/mxBase/ to download the package and find installation instructions. How to do it... Using timex is very simple: pass a string into the timex.tag() function and get back an annotated string. The annotations will be XML TIMEX tags surrounding each temporal expression. >>> import timex >>> timex.tag("Let's go sometime this week") "Let's go sometime <TIMEX2>this week</TIMEX2>" >>> timex.tag("Tomorrow I'm going to the park.") "<TIMEX2>Tomorrow</TIMEX2> I'm going to the park." How it works... The implementation of timex.py is essentially over 300 lines of conditional regular expression matches. When one of the known expressions match, it creates a RelativeDateTime object (from the mx.DateTime module). This RelativeDateTime is then converted back to a string with surrounding TIMEX tags and replaces the original matched string in the text. There's more... timex is smart enough not to tag expressions that have already been tagged, so it's ok to pass TIMEX tagged text into the tag() function. >>> timex.tag("Let's go sometime <TIMEX2>this week</TIMEX2>") "Let's go sometime <TIMEX2>this week</TIMEX2>"
Read more
  • 0
  • 0
  • 10218
Modal Close icon
Modal Close icon