Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-python-graphics-animation-principles
Packt
01 Dec 2010
7 min read
Save for later

Python graphics: animation principles

Packt
01 Dec 2010
7 min read
Animation is about making graphic objects move smoothly around a screen. The method to create the sensation of smooth dynamic action is simple: First present a picture to the viewer's eye. Allow the image to stay in view for about one-twentieth of a second. With a minimum of delay, present another picture where objects have been shifted by a small amount and repeat the process. Besides the obvious applications of making animated figures move around on a screen for entertainment, animating the results of computer code gives you powerful insights into how code works at a detailed level. Animation offers an extra dimension to the programmers' debugging arsenal. It provides you with an all encompassing, holistic view of software execution in progress that nothing else can. Static shifting of a ball with Python We make an image of a small colored disk and draw it in a sequence of different positions. How to do it... Execute the program shown and you will see a neat row of colored disks laid on top of each other going from top left to bottom right. The idea is to demonstrate the method of systematic position shifting. # moveball_1.py #>>>>>>>>>>>>> from Tkinter import * root = Tk() root.title("shifted sequence") cw = 250 # canvas width ch = 130 # canvas height chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) # The parameters determining the dimensions of the ball and its # position. # ===================================== posn_x = 1 # x position of box containing the ball (bottom) posn_y = 1 # y position of box containing the ball (left edge) shift_x = 3 # amount of x-movement each cycle of the 'for' loop shift_y = 2 # amount of y-movement each cycle of the 'for' loop ball_width = 12 # size of ball - width (x-dimension) ball_height = 12 # size of ball - height (y-dimension) color = "violet" # color of the ball for i in range(1,50): # end the program after 50 position shifts posn_x += shift_x posn_y += shift_y chart_1.create_oval(posn_x, posn_y, posn_x + ball_width, posn_y + ball_height, fill=color) root.mainloop() #>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How it works... A simple ball is drawn on a canvas in a sequence of steps, one on top of the other. For each step, the position of the ball is shifted by three pixels as specified by the size of shift_x. Similarly, a downward shift of two pixels is applied by an amount to the value of shift_y. shift_x and shift_y only specify the amount of shift, but they do not make it happen. What makes it happen are the two commands posn_x += shift_x and posn_y += shift_y. posn is the abbreviation for position. posn_x += shift_x means "take the variable posn_x and add to it an amount shift_x." It is the same as posn_x = posn_x + shift_x. Another minor point to note is the use of the line continuation character, the backslash "". We use this when we want to continue the same Python command onto a following line to make reading easier. Strictly speaking for text inside brackets "(...)" this is not needed. In this particular case you can just insert a carriage return character. However, the backslash makes it clear to anyone reading your code what your intention is. There's more... The series of ball images in this recipe were drawn in a few microseconds. To create decent looking animation, we need to be able to slow the code execution down by just the right amount. We need to draw the equivalent of a movie frame onto the screen and keep it there for a measured time and then move on to the next, slightly shifted, image. This is done in the next recipe. Time-controlled shifting of a ball Here we introduce the time control function canvas.after(milliseconds) and the canvas.update() function that refreshes the image on the canvas. These are the cornerstones of animation in Python. Control of when code gets executed is made possible by the time module that comes with the standard Python library. How to do it... Execute the program as previously. What you will see is a diagonal row of disks being laid in a line with a short delay of one fifth of a second (200 milliseconds) between updates. The result is shown in the following screenshot showing the ball shifting in regular intervals. # timed_moveball_1.py #>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * root = Tk() root.title("Time delayed ball drawing") cw = 300 # canvas width ch = 130 # canvas height chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 200 # time between fresh positions of the ball # (milliseconds). # The parameters determining the dimensions of the ball and it's # position. posn_x = 1 # x position of box containing the ball (bottom). posn_y = 1 # y position of box containing the ball (left edge). shift_x = 3 # amount of x-movement each cycle of the 'for' loop. shift_y = 3 # amount of y-movement each cycle of the 'for' loop. ball_width = 12 # size of ball - width (x-dimension). ball_height = 12 # size of ball - height (y-dimension). color = "purple" # color of the ball for i in range(1,50): # end the program after 50 position shifts. posn_x += shift_x posn_y += shift_y chart_1.create_oval(posn_x, posn_y, posn_x + ball_width, posn_y + ball_height, fill=color) chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 # milliseconds. root.mainloop() How it works... This recipe is the same as the previous one except for the canvas.after(...) and the canvas.update() methods. These are two functions that come from the Python library. The first gives you some control over code execution time by allowing you to specify delays in execution. The second forces the canvas to be completely redrawn with all the objects that should be there. There are more complicated ways of refreshing only portions of the screen, but they create difficulties so they will not be dealt with here. The canvas.after(your-chosen-milliseconds) method simply causes a timed-pause to the execution of the code. In all the preceding code, the pause is executed as fast as the computer can do it, then when the pause, invoked by the canvas.after() method is encountered, execution simply gets suspended for the specified number of milliseconds. At the end of the pause, execution continues as if nothing ever happened. The canvas.update() method forces everything on the canvas to be redrawn immediately rather than wait for some unspecified event to cause the canvas to be refreshed. There's more... The next step in effective animation is to erase the previous image of the object being animated shortly before a fresh, shifted clone is drawn on the canvas. This happens in the next example. The robustness of Tkinter It is also worth noting that Tkinter is robust. When you give position coordinates that are off the canvas, Python does not crash or freeze. It simply carries on drawing the object 'off-the-page'. The Tkinter canvas can be seen as just a tiny window into an almost unlimited universe of visual space. We only see objects when they move into the view of the camera which is the Tkinter canvas.
Read more
  • 0
  • 0
  • 12819

article-image-animating-graphic-objects-using-python
Packt
01 Dec 2010
9 min read
Save for later

Animating Graphic Objects using Python

Packt
01 Dec 2010
9 min read
Python 2.6 Graphics Cookbook Over 100 great recipes for creating and animating graphics using Python Create captivating graphics with ease and bring them to life using Python Apply effects to your graphics using powerful Python methods Develop vector as well as raster graphics and combine them to create wonders in the animation world Create interactive GUIs to make your creation of graphics simpler Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to accomplish the task of creation and animation of graphics as efficiently as possible        Precise collisions using floating point numbers Here the simulation flaws caused by the coarseness of integer arithmetic are eliminated by using floating point numbers for all ball position calculations. How to do it... All position, velocity, and gravity variables are made floating point by writing them with explicit decimal points. The result is shown in the following screenshot, showing the bouncing balls with trajectory tracing. from Tkinter import * root = Tk() root.title("Collisions with Floating point") cw = 350 # canvas width ch = 200 # canvas height GRAVITY = 1.5 chart_1 = Canvas(root, width=cw, height=ch, background="black") chart_1.grid(row=0, column=0) cycle_period = 80 # Time between new positions of the ball # (milliseconds). time_scaling = 0.2 # This governs the size of the differential steps # when calculating changes in position. # The parameters determining the dimensions of the ball and it's # position. ball_1 = {'posn_x':25.0, # x position of box containing the # ball (bottom). 'posn_y':180.0, # x position of box containing the # ball (left edge). 'velocity_x':30.0, # amount of x-movement each cycle of # the 'for' loop. 'velocity_y':100.0, # amount of y-movement each cycle of # the 'for' loop. 'ball_width':20.0, # size of ball - width (x-dimension). 'ball_height':20.0, # size of ball - height (y-dimension). 'color':"dark orange", # color of the ball 'coef_restitution':0.90} # proportion of elastic energy # recovered each bounce ball_2 = {'posn_x':cw - 25.0, 'posn_y':300.0, 'velocity_x':-50.0, 'velocity_y':150.0, 'ball_width':30.0, 'ball_height':30.0, 'color':"yellow3", 'coef_restitution':0.90} def detectWallCollision(ball): # Collision detection with the walls of the container if ball['posn_x'] > cw - ball['ball_width']: # Collision # with right-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] # reverse direction. ball['posn_x'] = cw - ball['ball_width'] if ball['posn_x'] < 1: # Collision with left-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = 2 # anti-stick to the wall if ball['posn_y'] < ball['ball_height'] : # Collision # with ceiling. ball['velocity_y'] = -ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ball['ball_height'] if ball['posn_y'] > ch - ball['ball_height']: # Floor # collision. ball['velocity_y'] = - ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ch - ball['ball_height'] def diffEquation(ball): # An approximate set of differential equations of motion # for the balls ball['posn_x'] += ball['velocity_x'] * time_scaling ball['velocity_y'] = ball['velocity_y'] + GRAVITY # a crude # equation incorporating gravity. ball['posn_y'] += ball['velocity_y'] * time_scaling chart_1.create_oval( ball['posn_x'], ball['posn_y'], ball['posn_x'] + ball['ball_width'], ball ['posn_y'] + ball['ball_height'], fill= ball['color']) detectWallCollision(ball) # Has the ball collided with # any container wall? for i in range(1,2000): # end the program after 1000 position shifts. diffEquation(ball_1) diffEquation(ball_2) chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 # milliseconds. chart_1.delete(ALL) # This erases everything on the root.mainloop() How it works... Use of precision arithmetic has allowed us to notice simulation behavior that was previously hidden by the sins of integer-only calculations. This is the UNIQUE VALUE OF GRAPHIC SIMULATION AS A DEBUGGING TOOL. If you can represent your ideas in a visual way rather than as lists of numbers you will easily pick up subtle quirks in your code. The human brain is designed to function best in graphical images. It is a direct consequence of being a hunter. A graphic debugging tool... There is another very handy trick in the software debugger's arsenal and that is the visual trace. A trace is some kind of visual trail that shows the history of dynamic behavior. All of this is revealed in the next example. Trajectory tracing and ball-to-ball collisions Now we introduce one of the more difficult behaviors in our simulation of ever increasing complexity – the mid-air collision. The hardest thing when you are debugging a program is to try to hold in your short term memory some recently observed behavior and compare it meaningfully with present behavior. This kind of memory is an imperfect recorder. The way to overcome this is to create a graphic form of memory – some sort of picture that shows accurately what has been happening in the past. In the same way that military cannon aimers use glowing tracer projectiles to adjust their aim, a graphic programmer can use trajectory traces to examine the history of execution. How to do it... In our new code there is a new function called detect_ball_collision (ball_1, ball_2) whose job is to anticipate imminent collisions between the two balls no matter where they are. The collisions will come from any direction and therefore we need to be able to test all possible collision scenarios and examine the behavior of each one and see if it does not work as planned. This can be too difficult unless we create tools to test the outcome. In this recipe, the tool for testing outcomes is a graphic trajectory trace. It is a line that trails behind the path of the ball and shows exactly where it went right since the beginning of the simulation. The result is shown in the following screenshot, showing the bouncing with ball-to-ball collision rebounds. # kinetic_gravity_balls_1.py # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * import math root = Tk() root.title("Balls bounce off each other") cw = 300 # canvas width ch = 200 # canvas height GRAVITY = 1.5 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 80 # Time between new positions of the ball # (milliseconds). time_scaling = 0.2 # The size of the differential steps # The parameters determining the dimensions of the ball and its # position. ball_1 = {'posn_x':25.0, 'posn_y':25.0, 'velocity_x':65.0, 'velocity_y':50.0, 'ball_width':20.0, 'ball_height':20.0, 'color':"SlateBlue1", 'coef_restitution':0.90} ball_2 = {'posn_x':180.0, 'posn_y':ch- 25.0, 'velocity_x':-50.0, 'velocity_y':-70.0, 'ball_width':30.0, 'ball_height':30.0, 'color':"maroon1", 'coef_restitution':0.90} def detect_wall_collision(ball): # detect ball-to-wall collision if ball['posn_x'] > cw - ball['ball_width']: # Right-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = cw - ball['ball_width'] if ball['posn_x'] < 1: # Left-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = 2 if ball['posn_y'] < ball['ball_height'] : # Ceiling. ball['velocity_y'] = -ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ball['ball_height'] if ball['posn_y'] > ch - ball['ball_height'] : # Floor ball['velocity_y'] = - ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ch - ball['ball_height'] def detect_ball_collision(ball_1, ball_2): #detect ball-to-ball collision # firstly: is there a close approach in the horizontal direction if math.fabs(ball_1['posn_x'] - ball_2['posn_x']) < 25: # secondly: is there also a close approach in the vertical # direction. if math.fabs(ball_1['posn_y'] - ball_2['posn_y']) < 25: ball_1['velocity_x'] = -ball_1['velocity_x'] # reverse # direction. ball_1['velocity_y'] = -ball_1['velocity_y'] ball_2['velocity_x'] = -ball_2['velocity_x'] ball_2['velocity_y'] = -ball_2['velocity_y'] # to avoid internal rebounding inside balls ball_1['posn_x'] += ball_1['velocity_x'] * time_scaling ball_1['posn_y'] += ball_1['velocity_y'] * time_scaling ball_2['posn_x'] += ball_2['velocity_x'] * time_scaling ball_2['posn_y'] += ball_2['velocity_y'] * time_scaling def diff_equation(ball): x_old = ball['posn_x'] y_old = ball['posn_y'] ball['posn_x'] += ball['velocity_x'] * time_scaling ball['velocity_y'] = ball['velocity_y'] + GRAVITY ball['posn_y'] += ball['velocity_y'] * time_scaling chart_1.create_oval( ball['posn_x'], ball['posn_y'], ball['posn_x'] + ball['ball_width'], ball['posn_y'] + ball['ball_height'], fill= ball['color'], tags="ball_tag") chart_1.create_line( x_old, y_old, ball['posn_x'], ball ['posn_y'], fill= ball['color']) detect_wall_collision(ball) # Has the ball # collided with any container wall? for i in range(1,5000): diff_equation(ball_1) diff_equation(ball_2) detect_ball_collision(ball_1, ball_2) chart_1.update() chart_1.after(cycle_period) chart_1.delete("ball_tag") # Erase the balls but # leave the trajectories root.mainloop() How it works... Mid-air ball against ball collisions are done in two steps. In the first step, we test whether the two balls are close to each other inside a vertical strip defined by if math.fabs(ball_1['posn_x'] - ball_2['posn_x']) < 25. In plain English, this asks "Is the horizontal distance between the balls less than 25 pixels?" If the answer is yes, then the region of examination is narrowed down to a small vertical distance less than 25 pixels by the statement if math.fabs(ball_1['posn_y'] - ball_2['posn_y']) < 25. So every time the loop is executed, we sweep the entire canvas to see if the two balls are both inside an area where their bottom-left corners are closer than 25 pixels to each other. If they are that close then we simply cause a rebound off each other by reversing their direction of travel in both the horizontal and vertical directions. There's more... Simply reversing the direction is not the mathematically correct way to reverse the direction of colliding balls. Certainly billiard balls do not behave that way. The law of physics that governs colliding spheres demands that momentum be conserved. Why do we sometimes get tkinter.TckErrors? If we click the close window button (the X in the top right) while Python is paused, when Python revives and then calls on Tcl (Tkinter) to draw something on the canvas we will get an error message. What probably happens is that the application has already shut down, but Tcl has unfinished business. If we allow the program to run to completion before trying to shut the window then termination is orderly.
Read more
  • 0
  • 0
  • 8314

article-image-parsing-specific-data-python-text-processing
Packt
23 Nov 2010
12 min read
Save for later

Parsing Specific Data in Python Text Processing

Packt
23 Nov 2010
12 min read
Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible  Introduction This article covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries for accomplishing this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to the NLTK: dateutil: Provides date/time parsing and time zone conversion timex: Can identify time words in text lxml and BeautifulSoup: Can parse, clean, and convert HTML chardet: Detects the character encoding of text The libraries can be useful for pre-processing text before passing it to an NLTK object, or post-processing text that has been processed and extracted using NLTK. Here's an example that ties many of these tools together. Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once you have the article text, you can use chardet to ensure it's UTF-8 before cleaning out the HTML and running it through NLTK-based part-of-speech tagging, chunk extraction, and/or text classification, to create additional metadata about the article. If there's an event happening at the restaurant, you may be able to discover that by looking at the time words identified by timex. The point of this example is that real-world text processing often requires more than just NLTK-based natural language processing, and the functionality covered in this article can help with those additional requirements. Parsing dates and times with Dateutil If you need to parse dates and times in Python, there is no better library than dateutil. The parser module can parse datetime strings in many more formats than can be shown here, while the tz module provides everything you need for looking up time zones. Combined, these modules make it quite easy to parse strings into time zone aware datetime objects. Getting ready You can install dateutil using pip or easy_install, that is sudo pip install dateutil or sudo easy_install dateutil. Complete documentation can be found at http://labix.org/python-dateutil How to do it... Let's dive into a few parsing examples: >>> from dateutil import parser >>> parser.parse('Thu Sep 25 10:36:28 2010') datetime.datetime(2010, 9, 25, 10, 36, 28) >>> parser.parse('Thursday, 25. September 2010 10:36AM') datetime.datetime(2010, 9, 25, 10, 36) >>> parser.parse('9/25/2010 10:36:28') datetime.datetime(2010, 9, 25, 10, 36, 28) >>> parser.parse('9/25/2010') datetime.datetime(2010, 9, 25, 0, 0) >>> parser.parse('2010-09-25T10:36:28Z') datetime.datetime(2010, 9, 25, 10, 36, 28, tzinfo=tzutc()) As you can see, all it takes is importing the parser module and calling the parse() function with a datetime string. The parser will do its best to return a sensible datetime object, but if it cannot parse the string, it will raise a ValueError. How it works... The parser does not use regular expressions. Instead, it looks for recognizable tokens and does its best to guess what those tokens refer to. The order of these tokens matters, for example, some cultures use a date format that looks like Month/Day/Year (the default order) while others use a Day/Month/Year format. To deal with this, the parse() function takes an optional keyword argument dayfirst, which defaults to False. If you set it to True, it can correctly parse dates in the latter format. >>> parser.parse('25/9/2010', dayfirst=True) datetime.datetime(2010, 9, 25, 0, 0) Another ordering issue can occur with two-digit years. For example, '10-9-25' is ambiguous. Since dateutil defaults to the Month-Day-Year format, '10-9-25' is parsed to the year 2025. But if you pass yearfirst=True into parse(), it will be parsed to the year 2010. >>> parser.parse('10-9-25') datetime.datetime(2025, 10, 9, 0, 0) >>> parser.parse('10-9-25', yearfirst=True) datetime.datetime(2010, 9, 25, 0, 0) There's more... The dateutil parser can also do fuzzy parsing, which allows it to ignore extraneous characters in a datetime string. With the default value of False, parse() will raise a ValueError when it encounters unknown tokens. But if fuzzy=True, then a datetime object can usually be returned. >>> try: ... parser.parse('9/25/2010 at about 10:36AM') ... except ValueError: ... 'cannot parse' 'cannot parse' >>> parser.parse('9/25/2010 at about 10:36AM', fuzzy=True) datetime.datetime(2010, 9, 25, 10, 36) Time zone lookup and conversion Most datetime objects returned from the dateutil parser are naive, meaning they don't have an explicit tzinfo, which specifies the time zone and UTC offset. In the previous recipe, only one of the examples had a tzinfo, and that's because it's in the standard ISO format for UTC date and time strings. UTC is the coordinated universal time, and is the same as GMT. ISO is the International Standards Organization, which among other things, specifies standard date and time formatting. Python datetime objects can either be naive or aware. If a datetime object has a tzinfo, then it is aware. Otherwise the datetime is naive. To make a naive datetime object time one aware, you must give it an explicit tzinfo. However, the Python datetime library only defines an abstract base class for tzinfo, and leaves it up to the others to actually implement tzinfo creation. This is where the tz module of dateutil comes in—it provides everything you need to lookup time zones from your OS time zone data. Getting ready dateutil should be installed using pip or easy_install. You should also make sure your operating system has time zone data. On Linux, this is usually found in /usr/share/zoneinfo, and the Ubuntu package is called tzdata. If you have a number of files and directories in /usr/share/zoneinfo, such as America/, Europe/, and so on, then you should be ready to proceed. The following examples show directory paths for Ubuntu Linux. How to do it... Let's start by getting a UTC tzinfo object. This can be done by calling tz.tzutc(), and you can check that the offset is 0 by calling the utcoffset() method with a UTC datetime object. >>> from dateutil import tz >>> tz.tzutc() tzutc() >>> import datetime >>> tz.tzutc().utcoffset(datetime.datetime.utcnow()) datetime.timedelta(0) To get tzinfo objects for other time zones, you can pass in a time zone file path to the gettz() function. >>> tz.gettz('US/Pacific') tzfile('/usr/share/zoneinfo/US/Pacific') >>> tz.gettz('US/Pacific').utcoffset(datetime.datetime.utcnow()) datetime.timedelta(-1, 61200) >>> tz.gettz('Europe/Paris') tzfile('/usr/share/zoneinfo/Europe/Paris') >>> tz.gettz('Europe/Paris').utcoffset(datetime.datetime.utcnow()) datetime.timedelta(0, 7200) You can see the UTC offsets are timedelta objects, where the first number is days, and the second number is seconds. If you're storing datetimes in a database, it's a good idea to store them all in UTC to eliminate any time zone ambiguity. Even if the database can recognize time zones, it's still a good practice. To convert a non-UTC datetime object to UTC, it must be made time zone aware. If you try to convert a naive datetime to UTC, you'll get a ValueError exception. To make a naive datetime time zone aware, you simply call the replace() method with the correct tzinfo. Once a datetime object has a tzinfo, then UTC conversion can be performed by calling the astimezone() method with tz.tzutc(). >>> pst = tz.gettz('US/Pacific') >>> dt = datetime.datetime(2010, 9, 25, 10, 36) >>> dt.tzinfo >>> dt.astimezone(tz.tzutc()) Traceback (most recent call last): File "/usr/lib/python2.6/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "<doctest __main__[22]>", line 1, in <module> dt.astimezone(tz.tzutc()) ValueError: astimezone() cannot be applied to a naive datetime >>> dt.replace(tzinfo=pst) datetime.datetime(2010, 9, 25, 10, 36, tzinfo=tzfile('/usr/share/ zoneinfo/US/Pacific')) >>> dt.replace(tzinfo=pst).astimezone(tz.tzutc()) datetime.datetime(2010, 9, 25, 17, 36, tzinfo=tzutc()) How it works... The tzutc and tzfile objects are both subclasses of tzinfo. As such, they know the correct UTC offset for time zone conversion (which is 0 for tzutc). A tzfile object knows how to read your operating system's zoneinfo files to get the necessary offset data. The replace() method of a datetime object does what its name implies—it replaces attributes. Once a datetime has a tzinfo, the astimezone() method will be able to convert the time using the UTC offsets, and then replace the current tzinfo with the new tzinfo. Note that both replace() and astimezone() return new datetime objects. They do not modify the current object. There's more... You can pass a tzinfos keyword argument into the dateutil parser to detect otherwise unrecognized time zones. >>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)', fuzzy=True) datetime.datetime(2010, 8, 4, 18, 30) >>> tzinfos = {'CDT': tz.gettz('US/Central')} >>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)', fuzzy=True, tzinfos=tzinfos) datetime.datetime(2010, 8, 4, 18, 30, tzinfo=tzfile('/usr/share/ zoneinfo/US/Central')) In the first instance, we get a naive datetime since the time zone is not recognized. However, when we pass in the tzinfos mapping, we get a time zone aware datetime. Local time zone If you want to lookup your local time zone, you can call tz.tzlocal(), which will use whatever your operating system thinks is the local time zone. In Ubuntu Linux, this is usually specified in the /etc/timezone file. Custom offsets You can create your own tzinfo object with a custom UTC offset using the tzoffset object. A custom offset of one hour can be created as follows: >>> tz.tzoffset('custom', 3600) tzoffset('custom', 3600) You must provide a name as the first argument, and the offset time in seconds as the second argument. Tagging temporal expressions with Timex The NLTK project has a little known contrib repository that contains, among other things, a module called timex.py that can tag temporal expressions. A temporal expression is just one or more time words, such as "this week", or "next month". These are ambiguous expressions that are relative to some other point in time, like when the text was written. The timex module provides a way to annotate text so these expressions can be extracted for further analysis. More on TIMEX can be found at http://timex2.mitre.org/ Getting ready The timex.py module is part of the nltk_contrib package, which is separate from the current version of NLTK. This means you need to install it yourself, or use the timex.py module. You can also download timex.py directly from http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/timex.py If you want to install the entire nltk_contrib package, you can check out the source at http://nltk.googlecode.com/svn/trunk/ and do sudo python setup.py install from within the nltk_contrib folder. If you do this, you'll need to do from nltk_contrib import timex instead of just import timex as done in the following How to do it... section. For this recipe, you have to download the timex.py module into the same folder as the rest of the code, so that import timex does not cause an ImportError. You'll also need to get the egenix-mx-base package installed. This is a C extension library for Python, so if you have all the correct Python development headers installed, you should be able to do sudo pip install egenix-mx-base or sudo easy_install egenix-mxbase. If you're running Ubuntu Linux, you can instead do sudo apt-get install pythonegenix-mxdatetime. If none of those work, you can go to http://www.egenix.com/products/python/mxBase/ to download the package and find installation instructions. How to do it... Using timex is very simple: pass a string into the timex.tag() function and get back an annotated string. The annotations will be XML TIMEX tags surrounding each temporal expression. >>> import timex >>> timex.tag("Let's go sometime this week") "Let's go sometime <TIMEX2>this week</TIMEX2>" >>> timex.tag("Tomorrow I'm going to the park.") "<TIMEX2>Tomorrow</TIMEX2> I'm going to the park." How it works... The implementation of timex.py is essentially over 300 lines of conditional regular expression matches. When one of the known expressions match, it creates a RelativeDateTime object (from the mx.DateTime module). This RelativeDateTime is then converted back to a string with surrounding TIMEX tags and replaces the original matched string in the text. There's more... timex is smart enough not to tag expressions that have already been tagged, so it's ok to pass TIMEX tagged text into the tag() function. >>> timex.tag("Let's go sometime <TIMEX2>this week</TIMEX2>") "Let's go sometime <TIMEX2>this week</TIMEX2>"
Read more
  • 0
  • 0
  • 10218

article-image-python-graphics-combining-raster-and-vector-pictures
Packt
23 Nov 2010
12 min read
Save for later

Python Graphics: Combining Raster and Vector Pictures

Packt
23 Nov 2010
12 min read
  Python 2.6 Graphics Cookbook Over 100 great recipes for creating and animating graphics using Python Create captivating graphics with ease and bring them to life using Python Apply effects to your graphics using powerful Python methods Develop vector as well as raster graphics and combine them to create wonders in the animation world Create interactive GUIs to make your creation of graphics simpler Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to accomplish the task of creation and animation of graphics as efficiently as possible Because we are not altering and manipulating the actual properties of the images we do not need the Python Imaging Library (PIL) in this chapter. We need to work exclusively with GIF format images because that is what Tkinter deals with. We will also see how to use "The GIMP" as a tool to prepare images suitable for animation. Simple animation of a GIF beach ball We want to animate a raster image, derived from a photograph. To keep things simple and clear we are just going to move a photographic image (in GIF format) of a beach ball across a black background. Getting ready We need a suitable GIF image of an object that we want to animate. An example of one, named beachball.gif has been provided. How to do it... Copy a .gif fle from somewhere and paste it into a directory where you want to keep your work-in-progress pictures. Ensure that the path in our computer's fle system leads to the image to be used. In the example below, the instruction, ball = PhotoImage(file = "constr/pics2/beachball.gif") says that the image to be used will be found in a directory (folder) called pics2, which is a sub-folder of another folder called constr. Then execute the following code. # photoimage_animation_1.py #>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * root = Tk() cycle_period = 100 cw = 300 # canvas width ch = 200 # canvas height canvas_1 = Canvas(root, width=cw, height=ch, bg="black") canvas_1.grid(row=0, column=1) posn_x = 10 posn_y = 10 shift_x = 2 shift_y = 1 ball = PhotoImage(file = "/constr/pics2/beachball.gif") for i in range(1,100): # end the program after 100 position shifts. posn_x += shift_x posn_y += shift_y canvas_1.create_image(posn_x,posn_y,anchor=NW, image=ball) canvas_1.update() # This refreshes the drawing on the canvas. canvas_1.after(cycle_period) # This makes execution pause for 100 milliseconds. canvas_1.delete(ALL) # This erases everything on the canvas. root.mainloop() How it Works The image of the beach ball is shifted across a canvas. The photo type images always occupy a rectangular area of screen. The size of this box, called the bounding box, is the size of the image. We have used a black background so the black corners on the image of our beach ball cannot be seen. The vector walking creature We make a pair of walking legs using the vector graphics. We want to use these legs together with pieces of raster images and see how far we can go in making appealing animations. We import Tkinter, math, and time modules. The math is needed to provide the trigonometry that sustains the geometric relations that move the parts of the leg in relation to each other. Getting ready We will be using Tkinter and time modules to animate the movement of lines and circles. You will see some trigonometry in the code. If you do not like mathematics you can just cut and paste the code without the need to understand exactly how the maths works. However, if you are a friend of mathematics it is fun to watch sine, cosine, and tangent working together to make a child smile. How to do it... Execute the program as shown in the previous image. # walking_creature_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("The thing that Strides") cw = 400 # canvas width ch = 100 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 100 # time between new positions of the ball (milliseconds). base_x = 20 base_y = 100 hip_h = 40 thy = 20 #=============================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] # The merging of the separate x and y lists into a single sequence. #================================== # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" . This returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2((y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) if L1/2 < thy: # The sign of alpha determines which way the knees bend. alpha = -math.acos(L1/(2*thy)) # Avian #alpha = math.acos(L1/(2*thy)) # Mammalian else: alpha = 0.0 theta_2 = alpha + theta_1 x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 milliseconds. chart_1.delete(ALL) # This erases *almost* everything on the canvas. # Does not delete the text from inside a function. bx_stay = base_x by_stay = base_y for j in range(0,11): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] chart_1.create_line(cx0, cy0 ,cx1 ,cy1) chart_1.create_oval(cx1-10 ,cy1-10 ,cx1+10 ,cy1+10, fill="orange") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - step_y[k] ay1 = base_y - step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - step_y[k-aa] by1 = base_y - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_oval(ax_stay-5 ,ay1-5 ,ax1+5 ,ay1+5, fill="green") chart_1.create_oval(bx_stay-5 ,by_stay-5 ,bx_stay+5 ,by_stay+5, fill="blue") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") animdelay() root.mainloop() How it works... Without getting bogged down in detail, the strategy in the program consists of defning the motion of a foot while walking one stride. This motion is defned by eight relative positions given by the two lists step_x (horizontal) and step_y (vertical). The motion of the hips is given by a separate pair of x- and y-positions hip_x and hip_y. Trigonometry is used to work out the position of the knee on the assumption that the thigh and lower leg are the same length. The calculation is based on the sine rule taught in high school. Yes, we do learn useful things at school! The time-animation regulation instructions are assembled together as a function animdelay(). There's more In Python math module, two arc-tangent functions are available for calculating angles given the lengths of two adjacent sides. atan2(y,x) is the best because it takes care of the crazy things a tangent does on its way around a circle - tangent ficks from minus infnity to plus infnity as it passes through 90 degrees and any multiples thereof. A mathematical knee is quite happy to bend forward or backward in satisfying its equations. We make the sign of the angle negative for a backward-bending bird knee and positive for a forward bending mammalian knee. More Info Section 1 This animated walking hips-and-legs is used in the recipes that follow this to make a bird walk in the desert, a diplomat in palace grounds, and a spider in a forest. Bird with shoes walking in the Karroo We now coordinate the movement of four GIF images and the striding legs to make an Apteryx (a fightless bird like the kiwi) that walks. Getting ready We need the following GIF images: A background picture of a suitable landscape A bird body without legs A pair of garish-colored shoes to make the viewer smile The walking avian legs of the previous recipe The images used are karroo.gif, apteryx1.gif, and shoe1.gif. Note that the images of the bird and the shoe have transparent backgrounds which means there is no rectangular background to be seen surrounding the bird or the shoe. In the recipe following this one, we will see the simplest way to achieve the necessary transparency. How to do it... Execute the program shown in the usual way. # walking_birdy_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("A Walking birdy gif and shoes images") cw = 800 # canvas width ch = 200 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 80 # time between new positions of the ball (milliseconds). im_backdrop = "/constr/pics1/karoo.gif" im_bird = "/constr/pics1/apteryx1.gif" im_shoe = "/constr/pics1/shoe1.gif" birdy =PhotoImage(file= im_bird) shoey =PhotoImage(file= im_shoe) backdrop = PhotoImage(file= im_backdrop) chart_1.create_image(0 ,0 ,anchor=NW, image=backdrop) base_x = 20 base_y = 190 hip_h = 70 thy = 60 #========================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] #============================================= # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" this returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2(-(y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) alpha = math.atan2(hip_h,L1) theta_2 = -(theta_1 - alpha) x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # Refresh the drawing on the canvas. chart_1.after(cycle_period) # Pause execution pause for X millise-conds. chart_1.delete("walking") # Erases everything on the canvas. bx_stay = base_x by_stay = base_y for j in range(0,13): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] #chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - 10 - step_y[k] ay1 = base_y - 10 -step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - 10 - step_y[k-aa] by1 = base_y - 10 - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 chart_1.create_image(ax_stay-5 ,ay_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") chart_1.create_image(bx_stay-5 ,by_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay-15 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay-15 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") animdelay() root.mainloop() How it works... The same remarks concerning the trigonometry made in the previous recipe apply here. What we see here now is the ease with which vector objects and raster images can be combined once suitable GIF images have been prepared. There's more... For teachers and their students who want to make lessons on a computer, these techniques offer all kinds of possibilities like history tours and re-enactments, geography tours, and, science experiments. Get the students to do projects telling stories. Animated year books?
Read more
  • 0
  • 0
  • 5142

article-image-python-text-processing-nltk-20-creating-custom-corpora
Packt
18 Nov 2010
12 min read
Save for later

Python text processing with NLTK 2.0: creating custom corpora

Packt
18 Nov 2010
12 min read
In this article, we'll cover how to use corpus readers and create custom corpora. At the same time, you'll learn how to use the existing corpus data that comes with NLTK. We'll also cover creating custom corpus readers, which can be used when your corpus is not in a file format that NLTK already recognizes, or if your corpus is not in files at all, but instead is located in a database such as MongoDB. Setting up a custom corpus A corpus is a collection of text documents, and corpora is the plural of corpus. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files. Getting ready You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:nltk_data on Windows, and /usr/share/nltk_data on Linux, Unix, or Mac OS X. How to do it... NLTK defines a list of data directories, or paths, in nltk.data.path. Our custom corpora must be within one of these paths so it can be found by NLTK. So as not to conflict with the official data package, we'll create a custom nltk_data directory in our home directory. Here's some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path: >>> import os, os.path >>> path = os.path.expanduser('~/nltk_data') >>> if not os.path.exists(path): ... os.mkdir(path) >>> os.path.exists(path) True >>> import nltk.data >>> path in nltk.data.path True If the last line, path in nltk.data.path, is True, then you should now have a nltk_ data directory in your home directory. The path should be %UserProfile%nltk_data on Windows, or ~/nltk_data on Unix, Linux, or Mac OS X. For simplicity, I'll refer to the directory as ~/nltk_data. If the last line does not return True, try creating the nltk_data directory manually in your home directory, then verify that the absolute path is in nltk.data.path. It's essential to ensure that this directory exists and is in nltk.data.path before continuing. Once you have your nltk_data directory, the convention is that corpora reside in a corpora subdirectory. Create this corpora directory within the nltk_data directory, so that the path is ~/nltk_data/corpora. Finally, we'll create a subdirectory in corpora to hold our custom corpus. Let's call it cookbook, giving us the full path of ~/nltk_data/corpora/cookbook. Now we can create a simple word list file and make sure it loads. The source code for this article can be downloaded here. Consider a word list file called mywords.txt. Put this file into ~/nltk_data/corpora/cookbook/. Now we can use nltk.data.load() to load the file. >>> import nltk.data >>> nltk.data.load('corpora/cookbook/mywords.txt', format='raw') 'nltkn' We need to specify format='raw' since nltk.data.load() doesn't know how to interpret .txt files. As we'll see, it does know how to interpret a number of other file formats. How it works... The nltk.data.load() function recognizes a number of formats, such as 'raw', 'pickle', and 'yaml'. If no format is specified, then it tries to guess the format based on the file's extension. In the previous case, we have a .txt file, which is not a recognized extension, so we have to specify the 'raw' format. But if we used a file that ended in .yaml, then we would not need to specify the format. Filenames passed in to nltk.data.load() can be absolute or relative paths. Relative paths must be relative to one of the paths specified in nltk.data.path. The file is found using nltk.data.find(path), which searches all known paths combined with the relative path. Absolute paths do not require a search, and are used as is. There's more... For most corpora access, you won't actually need to use nltk.data.load, as that will be handled by the CorpusReader classes covered in the following recipes. But it's a good function to be familiar with for loading .pickle files and .yaml files, plus it introduces the idea of putting all of your data files into a path known by NLTK. Loading a YAML file If you put the synonyms.yaml file into ~/nltk_data/corpora/cookbook (next to mywords.txt), you can use nltk.data. load() to load it without specifying a format. >>> import nltk.data >>> nltk.data.load('corpora/cookbook/synonyms.yaml') {'bday': 'birthday'} This assumes that PyYAML is installed. If not, you can find download and installation instructions at http://pyyaml.org/wiki/PyYAML. See also In the next recipes, we'll cover various corpus readers, and then in the Lazy corpus loading recipe, we'll use the LazyCorpusLoader, which expects corpus data to be in a corpora subdirectory of one of the paths specified by nltk.data.path. Creating a word list corpus The WordListCorpusReader is one of the simplest CorpusReader classes. It provides access to a file containing a list of words, one word per line. Getting ready We need to start by creating a word list file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist that looks like this: nltk corpus corpora wordnet How to do it... Now we can instantiate a WordListCorpusReader that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.' can be used as the directory path. Otherwise, you must use a directory path such as: 'nltk_data/corpora/ cookbook'. >>> from nltk.corpus.reader import WordListCorpusReader >>> reader = WordListCorpusReader('.', ['wordlist']) >>> reader.words() ['nltk', 'corpus', 'corpora', 'wordnet'] >>> reader.fileids() ['wordlist'] How it works... WordListCorpusReader inherits from CorpusReader, which is a common base class for all corpus readers. CorpusReader does all the work of identifying which files to read, while WordListCorpus reads the files and tokenizes each line to produce a list of words. Here's an inheritance diagram: When you call the words() function, it calls nltk.tokenize.line_tokenize() on the raw file data, which you can access using the raw() function. >>> reader.raw() 'nltkncorpusncorporanwordnetn' >>> from nltk.tokenize import line_tokenize >>> line_tokenize(reader.raw()) ['nltk', 'corpus', 'corpora', 'wordnet'] There's more... The stopwords corpus is a good example of a multi-file WordListCorpusReader. Names corpus Another word list corpus that comes with NLTK is the names corpus. It contains two files: female.txt and male.txt, each containing a list of a few thousand common first names organized by gender. >>> from nltk.corpus import names >>> names.fileids() ['female.txt', 'male.txt'] >>> len(names.words('female.txt')) 5001 >>> len(names.words('male.txt')) 2943 English words NLTK also comes with a large list of English words. There's one file with 850 basic words, and another list with over 200,000 known English words. >>> from nltk.corpus import words >>> words.fileids() ['en', 'en-basic'] >>> len(words.words('en-basic')) 850 >>> len(words.words('en')) 234936 Creating a part-of-speech tagged word corpus Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. Let us take a look at how to create and use a training corpus of part-of-speech tagged words. Getting ready The simplest format for a tagged corpus is of the form "word/tag". Following is an excerpt from the brown corpus: The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/ jj ./. Each word has a tag denoting its part-of-speech. For example, nn refers to a noun, while a tag that starts with vb is a verb. How to do it... If you were to put the previous excerpt into a file called brown.pos, you could then create a TaggedCorpusReader and do the following: >>> from nltk.corpus.reader import TaggedCorpusReader >>> reader = TaggedCorpusReader('.', r'.*.pos') >>> reader.words() ['The', 'expense', 'and', 'time', 'involved', 'are', ...] >>> reader.tagged_words() [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), …] >>> reader.sents() [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']] >>> reader.tagged_sents() [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]] >>> reader.paras() [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]] >>> reader.tagged_paras() [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]] How it works... This time, instead of naming the file explicitly, we use a regular expression, r'.*.pos', to match all files whose name ends with .pos. We could have done the same thing as we did with the WordListCorpusReader, and pass ['brown.pos'] as the second argument, but this way you can see how to include multiple files in a corpus without naming each one explicitly. TaggedCorpusReader provides a number of methods for extracting text from a corpus. First, you can get a list of all words, or a list of tagged tokens. A tagged token is simply a tuple of (word, tag). Next, you can get a list of every sentence, and also every tagged sentence, where the sentence is itself a list of words or tagged tokens. Finally, you can get a list of paragraphs, where each paragraph is a list of sentences, and each sentence is a list of words or tagged tokens. Here's an inheritance diagram listing all the major methods: There's more... The functions demonstrated in the previous diagram all depend on tokenizers for splitting the text. TaggedCorpusReader tries to have good defaults, but you can customize them by passing in your own tokenizers at initialization time. Customizing the word tokenizer The default word tokenizer is an instance of nltk.tokenize.WhitespaceTokenizer. If you want to use a different tokenizer, you can pass that in as word_tokenizer. >>> from nltk.tokenize import SpaceTokenizer >>> reader = TaggedCorpusReader('.', r'.*.pos', word_ tokenizer=SpaceTokenizer()) >>> reader.words() ['The', 'expense', 'and', 'time', 'involved', 'are', ...] Customizing the sentence tokenizer The default sentence tokenizer is an instance of nltk.tokenize.RegexpTokenize with 'n' to identify the gaps. It assumes that each sentence is on a line all by itself, and individual sentences do not have line breaks. To customize this, you can pass in your own tokenizer as sent_tokenizer. >>> from nltk.tokenize import LineTokenizer >>> reader = TaggedCorpusReader('.', r'.*.pos', sent_ tokenizer=LineTokenizer()) >>> reader.sents() [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']] Customizing the paragraph block reader Paragraphs are assumed to be split by blank lines. This is done with the default para_ block_reader, which is nltk.corpus.reader.util.read_blankline_block. There are a number of other block reader functions in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream. Their usage will be covered in more detail in the later recipe, Creating a custom corpus view, where we'll create a custom corpus reader. Customizing the tag separator If you don't want to use '/' as the word/tag separator, you can pass an alternative string to TaggedCorpusReader for sep. The default is sep='/', but if you want to split words and tags with '|', such as 'word|tag', then you should pass in sep='|'. Simplifying tags with a tag mapping function If you'd like to somehow transform the part-of-speech tags, you can pass in a tag_mapping_ function at initialization, then call one of the tagged_* functions with simplify_ tags=True. Here's an example where we lowercase each tag: >>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_ function=lambda t: t.lower()) >>> reader.tagged_words(simplify_tags=True) [('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), …] Calling tagged_words() without simplify_tags=True would produce the same result as if you did not pass in a tag_mapping_function. There are also a number of tag simplification functions defined in nltk.tag.simplify. These can be useful for reducing the number of different part-of-speech tags. >>> from nltk.tag import simplify >>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_ function=simplify.simplify_brown_tag) >>> reader.tagged_words(simplify_tags=True) [('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...] >>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_ function=simplify.simplify_tag) >>> reader.tagged_words(simplify_tags=True) [('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
Read more
  • 0
  • 0
  • 21228

article-image-python-text-processing-nltk-storing-frequency-distributions-redis
Packt
09 Nov 2010
9 min read
Save for later

Python Text Processing with NLTK: Storing Frequency Distributions in Redis

Packt
09 Nov 2010
9 min read
Storing a frequency distribution in Redis` Redis is a data structure server that is one of the more popular NoSQL databases. Among other things, it provides a network accessible database for storing dictionaries (also known as hash maps). Building a FreqDist interface to a Redis hash map will allow us to create a persistent FreqDist that is accessible to multiple local and remote processes at the same time. Most Redis operations are atomic, so it's even possible to have multiple processes write to the FreqDist concurrently. Getting ready For this and subsequent recipes, we need to install both Redis and redis-py. A quick start install guide for Redis is available at http://code.google.com/p/redis/wiki/ QuickStart. To use hash maps, you should install at least version 2.0.0 (the latest version as of this writing). The Redis Python driver redis-py can be installed using pip install redis or easy_ install redis. Ensure you install at least version 2.0.0 to use hash maps. The redispy homepage is at http://github.com/andymccurdy/redis-py/. Once both are installed and a redis-server process is running, you're ready to go. Let's assume redis-server is running on localhost on port 6379 (the default host and port). How to do it... The FreqDist class extends the built-in dict class, which makes a FreqDist an enhanced dictionary. The FreqDist class provides two additional key methods: inc() and N(). The inc() method takes a single sample argument for the key, along with an optional count keyword argument that defaults to 1, and increments the value at sample by count. N() returns the number of sample outcomes, which is the sum of all the values in the frequency distribution. We can create an API-compatible class on top of Redis by extending a RedisHashMap (that will be explained in the next section), then implementing the inc() and N() methods. Since the FreqDist only stores integers, we also override a few other methods to ensure values are always integers. This RedisHashFreqDist (defined in redisprob.py) uses the hincrby command for the inc() method to increment the sample value by count, and sums all the values in the hash map for the N() method. from rediscollections import RedisHashMap class RedisHashFreqDist(RedisHashMap): def inc(self, sample, count=1): self._r.hincrby(self._name, sample, count) def N(self): return int(sum(self.values())) def __getitem__(self, key): return int(RedisHashMap.__getitem__(self, key) or 0) def values(self): return [int(v) for v in RedisHashMap.values(self)] def items(self): return [(k, int(v)) for (k, v) in RedisHashMap.items(self)] We can use this class just like a FreqDist. To instantiate it, we must pass a Redis connection and the name of our hash map. The name should be a unique reference to this particular FreqDist so that it doesn't clash with any other keys in Redis. >>> from redis import Redis >>> from redisprob import RedisHashFreqDist >>> r = Redis() >>> rhfd = RedisHashFreqDist(r, 'test') >>> len(rhfd) 0 >>> rhfd.inc('foo') >>> rhfd['foo'] 1 >>> rhfd.items() >>> len(rhfd) 1 The name of the hash map and the sample keys will be encoded to replace whitespace and & characters with _. This is because the Redis protocol uses these characters for communication. It's best if the name and keys don't include whitespace to begin with. How it works... Most of the work is done in the RedisHashMap class, found in rediscollections.py, which extends collections.MutableMapping, then overrides all methods that require Redis-specific commands. Here's an outline of each method that uses a specific Redis command: __len__(): Uses the hlen command to get the number of elements in the hash map __contains__(): Uses the hexists command to check if an element exists in the hash map __getitem__(): Uses the hget command to get a value from the hash map __setitem__(): Uses the hset command to set a value in the hash map __delitem__(): Uses the hdel command to remove a value from the hash map keys(): Uses the hkeys command to get all the keys in the hash map values(): Uses the hvals command to get all the values in the hash map items(): Uses the hgetall command to get a dictionary containing all the keys and values in the hash map clear(): Uses the delete command to remove the entire hash map from Redis Extending collections.MutableMapping provides a number of other dict compatible methods based on the previous methods, such as update() and setdefault(), so we don't have to implement them ourselves. The initialization used for the RedisHashFreqDist is actually implemented here, and requires a Redis connection and a name for the hash map. The connection and name are both stored internally to use with all the subsequent commands. As mentioned before, whitespace is replaced by underscore in the name and all keys, for compatibility with the Redis network protocol. import collections, re white = r'[s&]+' def encode_key(key): return re.sub(white, '_', key.strip()) class RedisHashMap(collections.MutableMapping): def __init__(self, r, name): self._r = r self._name = encode_key(name) def __iter__(self): return iter(self.items()) def __len__(self): return self._r.hlen(self._name) def __contains__(self, key): return self._r.hexists(self._name, encode_key(key)) def __getitem__(self, key): return self._r.hget(self._name, encode_key(key)) def __setitem__(self, key, val): self._r.hset(self._name, encode_key(key), val) def __delitem__(self, key): self._r.hdel(self._name, encode_key(key)) def keys(self): return self._r.hkeys(self._name) def values(self): return self._r.hvals(self._name) def items(self): return self._r.hgetall(self._name).items() def get(self, key, default=0): return self[key] or default def iteritems(self): return iter(self) def clear(self): self._r.delete(self._name) There's more... The RedisHashMap can be used by itself as a persistent key-value dictionary. However, while the hash map can support a large number of keys and arbitrary string values, its storage structure is more optimal for integer values and smaller numbers of keys. However, don't let that stop you from taking full advantage of Redis. It's very fast (for a network server) and does its best to efficiently encode whatever data you throw at it. While Redis is quite fast for a network database, it will be significantly slower than the in-memory FreqDist. There's no way around this, but while you sacrifice speed, you gain persistence and the ability to do concurrent processing. See also In the next recipe, we'll create a conditional frequency distribution based on the Redis frequency distribution created here. Storing a conditional frequency distribution in Redis The nltk.probability.ConditionalFreqDist class is a container for FreqDist instances, with one FreqDist per condition. It is used to count frequencies that are dependent on another condition, such as another word or a class label. Here, we'll create an API-compatible class on top of Redis using the RedisHashFreqDist from the previous recipe. Getting ready As in the previous recipe, you'll need to have Redis and redis-py installed with an instance of redis-server running. How to do it... We define a RedisConditionalHashFreqDist class in redisprob.py that extends nltk.probability.ConditionalFreqDist and overrides the __contains__() and __getitem__() methods. We then override __getitem__() so we can create an instance of RedisHashFreqDist instead of a FreqDist, and override __contains__() so we can call encode_key() from the rediscollections module before checking if the RedisHashFreqDist exists. from nltk.probability import ConditionalFreqDist from rediscollections import encode_key class RedisConditionalHashFreqDist(ConditionalFreqDist): def __init__(self, r, name, cond_samples=None): self._r = r self._name = name ConditionalFreqDist.__init__(self, cond_samples) # initialize self._fdists for all matching keys for key in self._r.keys(encode_key('%s:*' % name)): condition = key.split(':')[1] self[condition] # calls self.__getitem__(condition) def __contains__(self, condition): return encode_key(condition) in self._fdists def __getitem__(self, condition): if condition not in self._fdists: key = '%s:%s' % (self._name, condition) self._fdists[condition] = RedisHashFreqDist(self._r, key) return self._fdists[condition] def clear(self): for fdist in self._fdists.values(): fdist.clear() An instance of this class can be created by passing in a Redis connection and a base name. After that, it works just like a ConditionalFreqDist. >>> from redis import Redis >>> from redisprob import RedisConditionalHashFreqDist >>> r = Redis() >>> rchfd = RedisConditionalHashFreqDist(r, 'condhash') >>> rchfd.N() 0 >>> rchfd.conditions() [] >>> rchfd['cond1'].inc('foo') >>> rchfd.N() 1 >>> rchfd['cond1']['foo'] 1 >>> rchfd.conditions() ['cond1'] >>> rchfd.clear() How it works... The RedisConditionalHashFreqDist uses name prefixes to reference RedisHashFreqDist instances. The name passed in to the RedisConditionalHashFreqDist is a base name that is combined with each condition to create a unique name for each RedisHashFreqDist. For example, if the base name of the RedisConditionalHashFreqDist is 'condhash', and the condition is 'cond1', then the final name for the RedisHashFreqDist is 'condhash:cond1'. This naming pattern is used at initialization to find all the existing hash maps using the keys command. By searching for all keys matching 'condhash:*', we can identify all the existing conditions and create an instance of RedisHashFreqDist for each. Combining strings with colons is a common naming convention for Redis keys as a way to define namespaces. In our case, each RedisConditionalHashFreqDist instance defines a single namespace of hash maps. The ConditionalFreqDist class stores an internal dictionary at self._fdists that is a mapping of condition to FreqDist. The RedisConditionalHashFreqDist class still uses self._fdists, but the values are instances of RedisHashFreqDist instead of FreqDist. self._fdists is created when we call ConditionalFreqDist.__init__(), and values are initialized as necessary in the __getitem__() method. There's more... RedisConditionalHashFreqDist also defines a clear() method. This is a helper method that calls clear() on all the internal RedisHashFreqDist instances. The clear() method is not defined in ConditionalFreqDist. See also The previous recipe covers the RedisHashFreqDist in detail.
Read more
  • 0
  • 0
  • 9688
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-using-execnet-parallel-and-distributed-processing-nltk
Packt
09 Nov 2010
8 min read
Save for later

Using Execnet for Parallel and Distributed Processing with NLTK

Packt
09 Nov 2010
8 min read
  Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible     Introduction NLTK is great for in-memory single-processor natural language processing. However, there are times when you have a lot of data to process and want to take advantage of multiple CPUs, multi-core CPUs, and even multiple computers. Or perhaps you want to store frequencies and probabilities in a persistent, shared database so multiple processes can access it simultaneously. For the first case, we'll be using execnet to do parallel and distributed processing with NLTK. For the second case, you'll learn how to use the Redis data structure server/database to store frequency distributions and more. Distributed tagging with execnet Execnet is a distributed execution library for python. It allows you to create gateways and channels for remote code execution. A gateway is a connection from the calling process to a remote environment. The remote environment can be a local subprocess or an SSH connection to a remote node. A channel is created from a gateway and handles communication between the channel creator and the remote code. Since many NLTK processes require 100 percent CPU utilization during computation, execnet is an ideal way to distribute that computation for maximum resource usage. You can create one gateway per CPU core, and it doesn't matter whether the cores are in your local computer or spread across remote machines. In many situations, you only need to have the trained objects and data on a single machine, and can send the objects and data to the remote nodes as needed. Getting ready You'll need to install execnet for this to work. It should be as simple as sudo pip install execnet or sudo easy_install execnet. The current version of execnet, as of this writing, is 1.0.8. The execnet homepage, which has API documentation and examples, is at http://codespeak.net/execnet/. How to do it... We start by importing the required modules, as well as an additional module remote_tag. py that will be explained in the next section. We also need to import pickle so we can serialize the tagger. Execnet does not natively know how to deal with complex objects such as a part-of-speech tagger, so we must dump the tagger to a string using pickle.dumps(). We'll use the default tagger that's used by the nltk.tag.pos_tag() function, but you could load and dump any pre-trained part-of-speech tagger as long as it implements the TaggerI interface. Once we have a serialized tagger, we start execnet by making a gateway with execnet. makegateway(). The default gateway creates a Python subprocess, and we can call the remote_exec() method with the remote_tag module to create a channel. With an open channel, we send over the serialized tagger and then the first tokenized sentence of the treebank corpus. You don't have to do any special serialization of simple types such as lists and tuples, since execnet already knows how to handle serializing the built-in types. Now if we call channel.receive(), we get back a tagged sentence that is equivalent to the first tagged sentence in the treebank corpus, so we know the tagging worked. We end by exiting the gateway, which closes the channel and kills the subprocess. >>> import execnet, remote_tag, nltk.tag, nltk.data >>> from nltk.corpus import treebank >>> import cPickle as pickle >>> tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER)) >>> gw = execnet.makegateway() >>> channel = gw.remote_exec(remote_tag) >>> channel.send(tagger) >>> channel.send(treebank.sents()[0]) >>> tagged_sentence = channel.receive() >>> tagged_sentence == treebank.tagged_sents()[0] True >>> gw.exit() Visually, the communication process looks like this: How it works... The gateway's remote_exec() method takes a single argument that can be one of the following three types: A string of code to execute remotely. The name of a pure function that will be serialized and executed remotely. The name of a pure module whose source will be executed remotely. We use the third option with the remote_tag.py module, which is defined as follows: import cPickle as pickle if __name__ == '__channelexec__': tagger = pickle.loads(channel.receive()) for sentence in channel: channel.send(tagger.tag(sentence)) A pure module is a module that is self-contained. It can only access Python modules that are available where it executes, and does not have access to any variables or states that exist wherever the gateway is initially created. To detect that the module is being executed by execnet, you can look at the __name__ variable. If it's equal to '__channelexec__', then it is being used to create a remote channel. This is similar to doing if __name__ == '__ main__' to check if a module is being executed on the command line. The first thing we do is call channel.receive() to get the serialized tagger, which we load using pickle.loads(). You may notice that channel is not imported anywhere—that's because it is included in the global namespace of the module. Any module that execnet executes remotely has access to the channel variable in order to communicate with the channel` creator. Once we have the tagger, we iteratively tag() each tokenized sentence that we receive from the channel. This allows us to tag as many sentences as the sender wants to send, as iteration will not stop until the channel is closed. What we've essentially created is a compute node for part-of-speech tagging that dedicates 100 percent of its resources to tagging whatever sentences it receives. As long as the channel remains open, the node is available for processing. There's more... This is a simple example that opens a single gateway and channel. But execnet can do a lot more, such as opening multiple channels to increase parallel processing, as well as opening gateways to remote hosts over SSH to do distributed processing. Multiple channels We can create multiple channels, one per gateway, to make the processing more parallel. Each gateway creates a new subprocess (or remote interpreter if using an SSH gateway) and we use one channel per gateway for communication. Once we've created two channels, we can combine them using the MultiChannel class, which allows us to iterate over the channels, and make a receive queue to receive messages from each channel. After creating each channel and sending the tagger, we cycle through the channels to send an even number of sentences to each channel for tagging. Then we collect all the responses from the queue. A call to queue.get() will return a 2-tuple of (channel, message) in case you need to know which channel the message came from. If you don't want to wait forever, you can also pass a timeout keyword argument with the maximum number of seconds you want to wait, as in queue.get(timeout=4). This can be a good way to handle network errors. Once all the tagged sentences have been collected, we can exit the gateways. Here's the code: >>> import itertools >>> gw1 = execnet.makegateway() >>> gw2 = execnet.makegateway() >>> ch1 = gw1.remote_exec(remote_tag) >>> ch1.send(tagger) >>> ch2 = gw2.remote_exec(remote_tag) >>> ch2.send(tagger) >>> mch = execnet.MultiChannel([ch1, ch2]) >>> queue = mch.make_receive_queue() >>> channels = itertools.cycle(mch) >>> for sentence in treebank.sents()[:4]: ... channel = channels.next() ... channel.send(sentence) >>> tagged_sentences = [] >>> for i in range(4): ... channel, tagged_sentence = queue.get() ... tagged_sentences.append(tagged_sentence) >>> len(tagged_sentences) 4 >>> gw1.exit() >>> gw2.exit() Local versus remote gateways The default gateway spec is popen, which creates a Python subprocess on the local machine. This means execnet.makegateway() is equivalent to execnet. makegateway('popen'). If you have passwordless SSH access to a remote machine, then you can create a remote gateway using execnet.makegateway('ssh=remotehost') where remotehost should be the hostname of the machine. A SSH gateway spawns a new Python interpreter for executing the code remotely. As long as the code you're using for remote execution is pure, you only need a Python interpreter on the remote machine. Channels work exactly the same no matter what kind of gateway is used; the only difference will be communication time. This means you can mix and match local subprocesses with remote interpreters to distribute your computations across many machines in a network. There are many more details on gateways in the API documentation at http://codespeak.net/execnet/basics.html.
Read more
  • 0
  • 0
  • 3905

article-image-organizing-clarifying-and-communicating-r-data-analyses
Packt
29 Oct 2010
8 min read
Save for later

Organizing, Clarifying and Communicating the R Data Analyses

Packt
29 Oct 2010
8 min read
  Statistical Analysis with R Take control of your data and produce superior statistical analysis with R. An easy introduction for people who are new to R, with plenty of strong examples for you to work through This book will take you on a journey to learn R as the strategist for an ancient Chinese kingdom! A step by step guide to understand R, its benefits, and how to use it to maximize the impact of your data analysis A practical guide to conduct and communicate your data analysis with R in the most effective manner           Read more about this book       (For more resources on R, see here.) Retracing and refining a complete analysis For demonstration purposes, it will be assumed that a fire attack was chosen as the optimal battle strategy. Throughout this segment, we will retrace the steps that lead us to this decision. Meanwhile, we will make sure to organize and clarify our analyses so they can be easily communicated to others. Suppose we determined our fire attack will take place 225 miles away in Anding, which houses 10,000 Wei soldiers. We will deploy 2,500 soldiers for a period of 7 days and assume that they are able to successfully execute the plans. Let us return to the beginning to develop this strategy with R in a clear and concise manner. Time for action – first steps To begin our analysis, we must first launch R and set our working directory: Launch R. The R console will be displayed. Set your R working directory using the setwd(dir) function. The following code is a hypothetical example. Your working directory should be a relevant location on your own computer. > #set the R working directory using setwd(dir)> setwd("/Users/johnmquick/rBeginnersGuide/") Verify that your working directory has been set to the proper location using the getwd() command : > #verify the location of your working directory> getwd()[1] "/Users/johnmquick/rBeginnersGuide/" What just happened? We prepared R to begin our analysis by launching the soft ware and setting our working directory. At this point, you should be very comfortable completing these steps. Time for action – data setup Next, we need to import our battle data into R and isolate the portion pertaining to past fire attacks: Copy the battleHistory.csv file into your R working directory. This file contains data from 120 previous battles between the Shu and Wei forces. Read the contents of battleHistory.csv into an R variable named battleHistory using the read.table(...) command: > #read the contents of battleHistory.csv into an R variable> #battleHistory contains data from 120 previous battlesbetween the Shu and Wei forces> battleHistory <- read.table("battleHistory.csv", TRUE, ",") Create a subset using the subset(data, ...) function and save it to a new variable named subsetFire: > #use the subset(data, ...) function to create a subset ofthe battleHistory dataset that contains data only from battlesin which the fire attack strategy was employed> subsetFire <- subset(battleHistory, battleHistory$Method =="fire") Verify the contents of the new subset. Note that the console should return 30 rows, all of which contain fire in the Method column: > #display the fire attack data subset> subsetFire What just happened? We imported our dataset and then created a subset containing our fire attack data. However, we used a slightly different function, called read.table(...), to import our external data into R. read.table(...) U p to this point, we have always used the read.csv() function to import data into R. However, you should know that there are oft en many ways to accomplish the same objectives in R. For instance, read.table(...) is a generic data import function that can handle a variety of file types. While it accepts several arguments, the following three are required to properly import a CSV file, like the one containing our battle history data: file: t he name of the file to be imported, along with its extension, in quotes header: whether or not the file contains column headings; TRUE for yes, FALSE (default) for no sep: t he character used to separate values in the file, in quotes Using these arguments, we were able to import the data in our battleHistory.csv into R. Since our file contained headings, we used a value of TRUE for the header argument and because it is a comma-separated values file, we used "," for our sep argument: > battleHistory <- read.table("battleHistory.csv", TRUE, ",") This is just one example of how a different technique can be used to achieve a similar outcome in R. We will continue to explore new methods in our upcoming activities. Pop quiz Suppose you wanted to import the following dataset, named newData into R. Which of the following read.table(...) functions would be best to use? 4,55,96,12 read.table("newData", FALSE, ",") read.table("newData", TRUE, ",") read.table("newData.csv", FALSE, ",") read.table("newData.csv", TRUE, ",") Time for action – data exploration To begin our analysis, we will examine the summary statistics and correlations of our data. These will give us an overview of the data and inform our subsequent analyses: Generate a summary of the fire attack subset using summary(object): > #generate a summary of the fire subset> summaryFire <- summary(subsetFire)> #display the summary> summaryFire Before calculating correlations, we will have to convert our nonnumeric data from the Method, SuccessfullyExecuted, and Result columns into numeric form. Re code the Method column using as.numeric(data): > #represent categorical data numerically usingas.numeric(data)> #recode the Method column into Fire = 1> numericMethodFire <- as.numeric(subsetFire$Method) - 1 Recode the SuccessfullyExecuted column using as.numeric(data): > #recode the SuccessfullyExecuted column into N = 0 and Y = 1> numericExecutionFire <-as.numeric(subsetFire$SuccessfullyExecuted) - 1 Recode the Result column using as.numeric(data): > #recode the Result column into Defeat = 0 and Victory = 1> numericResultFire <- as.numeric(subsetFire$Result) - 1 With the Method, SuccessfullyExecuted, and Result columns coded into numeric form, let us now add them back into our fire dataset. Save the data in our recoded variables back into the original dataset: > #save the data in the numeric Method, SuccessfullyExecuted,and Result columns back into the fire attack dataset> subsetFire$Method <- numericMethodFire> subsetFire$SuccessfullyExecuted <- numericExecutionFire> subsetFire$Result <- numericResultFire Display the numeric version of the fire attack subset. Notice that all of the columns now contain numeric data; it will look like the following: Having replaced our original text values in the SuccessfullyExecuted and Result columns with numeric data, we can now calculate all of the correlations in the dataset using the cor(data) function: > #use cor(data) to calculate all of the correlations in thefire attack dataset> cor(subsetFire) Note that the error message and NA values in our correlation output result from the fact that our Method column contains only a single value. This is irrelevant to our analysis and can be ignored. What just happened? Initially, we calculated summary statistics for our fire attack dataset using the summary(object) function. From this information, we can derive the following useful insights about our past battles: The rating of the Shu army's performance in fire attacks has ranged from 10 to 100, with a mean of 45 Fire attack plans have been successfully executed 10 out of 30 times (33%) Fire attacks have resulted in victory 8 out of 30 times (27%) Successfully executed fire attacks have resulted in victory 8 out of 10 times (80%), while unsuccessful attacks have never resulted in victory The number of Shu soldiers engaged in fire attacks has ranged from 100 to 10,000 with a mean of 2,052 The number of Wei soldiers engaged in fire attacks has ranged from 1,500 to 50,000 with a mean of 12,333 The duration of fire attacks has ranged from 1 to 14 days with a mean of 7 Next, we recoded the text values in our dataset's Method, SuccessfullyExecuted, and Result columns into numeric form. Aft er adding the data from these variables back into our our original dataset, we were able to calculate all of its correlations. This allowed us to learn even more about our past battle data: The performance rating of a fire attack has been highly correlated with successful execution of the battle plans (0.92) and the battle's result (0.90), but not strongly correlated with the other variables. The execution of a fire attack has been moderately negatively correlated with the duration of the attack, such that a longer attack leads to a lesser chance of success (-0.46). The numbers of Shu and Wei soldiers engaged are highly correlated with each other (0.74), but not strongly correlated with the other variables. The insights gleaned from our summary statistics and correlations put us in a prime position to begin developing our regression model. Pop quiz Which of the following is a benefit of adding a text variable back into its original dataset aft er it has been recoded into numeric form? Calculation functions can be executed on the recoded variable. Calculation functions can be executed on the other variables in the dataset. Calculation functions can be executed on the entire dataset. There is no benefit.
Read more
  • 0
  • 0
  • 3229

article-image-graphical-capabilities-r
Packt
29 Oct 2010
8 min read
Save for later

Graphical Capabilities of R

Packt
29 Oct 2010
8 min read
Statistical Analysis with R Take control of your data and produce superior statistical analysis with R. An easy introduction for people who are new to R, with plenty of strong examples for you to work through This book will take you on a journey to learn R as the strategist for an ancient Chinese kingdom! A step by step guide to understand R, its benefits, and how to use it to maximize the impact of your data analysis A practical guide to conduct and communicate your data analysis with R in the most effective manner           Read more about this book       (For more resources on R, see here.) Time for action — creating a line chart The ever popular line chart, or line graph, depicts relationships as continuous series of connected data points. Line charts are particularly useful for visualizing specific values and trends over time. Just as a line chart is an extension of a scatterplot in the non-digital realm, a line chart is created using an extended form of the plot(...) function in R. Let us explore how to extend the plot(...) function to create line charts in R: Use the type argument within the plot(...) function to create a line chart that depicts a single relationship between two variables: > #create a line chart that depicts the durations of past fire attacks> #get the data to be used in the chart> lineFireDurationDataX <- c(1:30)> lineFireDurationDataY <- subsetFire$DurationInDays> #customize the chart> lineFireDurationMain <- "Duration of Past Fire Attacks"> lineFireDurationLabX <- "Battle Number"> lineFireDurationLabY <- "Duration in Days"> #use the type argument to connect the data points with a line> lineFireDurationType <- "o"> #use plot(...) to create and display the line chart> plot(x = lineFireDurationDataX, y = lineFireDurationDataY,main = lineFireDurationMain, xlab = lineFireDurationLabX,ylab = lineFireDurationLabY, type = lineFireDurationType) Your chart will be displayed in the graphic window, as follows: What just happened? We expanded our use of the plot(...) function to generate a line chart and encountered a new data notation in the process. Let us review these features. type In the plot(...) function, the type argument determines what kind of line, if any, should be used to connect a chart's data points. The type argument receives one of several character values, all of which are listed as follows: p: only points are plotted; this is the default value when type is undefined l: only lines are drawn, without any points o: both lines and points are drawn, with the lines overlapping the points b: both lines and points are drawn, with the lines broken where they intersect with points c: only lines are drawn, but they are broken where points would occur s: only the lines are drawn in step formation; the initial step begins at zero S: (uppercase) only the lines are drawn in step formation; the final step tails off at the last point h: vertical lines are drawn to represent each point n: no points nor lines are drawn Our chart, which represented the duration of past fire attacks, featured a line that overlapped the plotted points. First, we defined our desired line type in an R variable: > lineFireDurationType <- "o" Then the type argument was placed within our plot(...) function to generate the line chart: > plot(lineFireDurationDataX, lineFireDurationDataY,main = lineFireDurationMain, xlab = lineFireDurationLabX,ylab = lineFireDurationLabY,type = lineFireDurationType) Number-colon-number notation You may have noticed that we specified a vector for the x-axis data in our plot(...) function. > lineFireDurationDataX <- c(1:30) This vector used number-colon-number notation. Essentially, this notation has the effect of enumerating a range of values that lie between the number that precedes the colon and the number that follows it. To do so, it adds one to the beginning value until it reaches a final value that is equal to or less than the number that comes after the colon. For example, the code > 14:21 would yield eight whole numbers, beginning with 14 and ending with 21, as follows: [1] 14 15 16 17 18 19 20 21 Furthermore, the code > 14.2:21 would yield seven values, beginning with 14.2 and ending with 20.2, as follows: [1] 14.2 15.2 16.2 17.2 18.2 19.2 20.2 Number-colon-number notation is a useful way to enumerate a series of values without having to type each one individually. It can be used in any circumstance where a series of values is acceptable input into an R function. Number-colon-number notation can also enumerate values from high to low. For instance, 21:14 would yield a list of values beginning with 21 and ending with 14. Since we do not have exact dates or other identifying information for our 30 past battles, we simply enumerated the numbers 1 through 30 on the x-axis. This had the effect of assigning a generic identification number to each of our past battles, which in turn allowed us to plot the duration of each battle on the y axis. Pop quiz Which of the following is the type argument capable of? Drawing a line to connect or replace the points on a scatterplot. Drawing vertical or step lines. Drawing no points or lines. All of the above. What would the following line of code yield in the R console? > 1:50 A sequence of 50 whole numbers, in order from 1 to 50. A sequence of 50 whole numbers, in order from 50 to 1. A sequence of 50 random numbers, in order from 1 to 50. A sequence of 50 random numbers, in order from 50 to 1. Time for action — creating a box plot A useful way to convey a collection of summary statistics in a dataset is through the use of a box plot. This type of graph depicts a dataset's minimum and maximum, as well as its lower, median, and upper quartiles in a single diagram. Let us look at how box plots are created in R: Use the boxplot(...) function to create a box plot. > #create a box plot that depicts the number of soldiers required to launch a fire attack> #get the data to be used in the plot> boxplotFireShuSoldiersData <- subsetFire$ShuSoldiers> #customize the plot> boxPlotFireShuSoldiersLabelMain <- "Number of Soldiers Required to Launch a Fire Attack"> boxPlotFireShuSoldiersLabelX <- "Fire Attack Method"> boxPlotFireShuSoldiersLabelY <- "Number of Soldiers"> #use boxplot(...) to create and display the box plot> boxplot(x = boxplotFireShuSoldiersData,main = boxPlotFireShuSoldiersLabelMain,xlab = boxPlotFireShuSoldiersLabelX,ylab = boxPlotFireShuSoldiersLabelY) Your plot will be displayed in the graphic window, as shown in the following: Use the boxplot(...) function to create a box plot that compares multiple datasets. > #create a box plot that compares the number of soldiers required across the battle methods> #get the data formula to be used in the plot> boxplotAllMethodsShuSoldiersData <- battleHistory$ShuSoldiers ~ battleHistory$Method> #customize the plot> boxPlotAllMethodsShuSoldiersLabelMain <- "Number of Soldiers Required by Battle Method"> boxPlotAllMethodsShuSoldiersLabelX <- "Battle Method"> boxPlotAllMethodsShuSoldiersLabelY <- "Number of Soldiers"> #use boxplot(...) to create and display the box plot> boxplot(formula = boxplotAllMethodsShuSoldiersData,main = boxPlotAllMethodsShuSoldiersLabelMain,xlab = boxPlotAllMethodsShuSoldiersLabelX,ylab = boxPlotAllMethodsShuSoldiersLabelY) Your plot will be displayed in the graphic window, as shown in the following: What just happened? We just created two box plots using R's boxplot(...) function, one with a single box and one with multiple boxes. boxplot(...) We started by generating a single box plot that was composed of a dataset, main title, and x and y labels. The basic format for a single box plot is as follows: boxplot(x = dataset) The x argument contains the data to be plotted. Technically, only x is required to create a box plot, although you will often include additional arguments. Our boxplot(...) function used the main, xlab, and ylab arguments to display text on the plot, as shown: > boxplot(x = boxplotFireShuSoldiersData,main = boxPlotFireShuSoldiersLabelMain,xlab = boxPlotFireShuSoldiersLabelX,ylab = boxPlotFireShuSoldiersLabelY) Next, we created a multiple box plot that compared the number of Shu soldiers deployed by each battle method. The main, xlab, and ylab arguments remained from our single box plot, however our multiple box plot used the formula argument instead of x. Here, a formula allows us to break a dataset down into separate groups, thus yielding multiple boxes. The basic format for a multiple box plot is as follows: boxplot(formula = dataset ~ group) In our case, we took our entire Shu soldier dataset (battleHistory$ShuSoldiers) and separated it by battle method (battleHistory$Method): > boxplotAllMethodsShuSoldiersData <- battleHistory$ShuSoldiers ~ battleHistory$Method Once incorporated into the boxplot(...) function, this formula resulted in a plot that contained four distinct boxes—ambush, fire, head to head, and surround: > boxplot(formula = boxplotAllMethodsShuSoldiersData,main = boxPlotAllMethodsShuSoldiersLabelMain,xlab = boxPlotAllMethodsShuSoldiersLabelX,ylab = boxPlotAllMethodsShuSoldiersLabelY) Pop quiz Which of the following best describes the result of the following code? > boxplot(x = a) A single box plot of the a dataset. A single box plot of the x dataset. A multiple box plot of the a dataset that is grouped by x. A multiple box plot of the x dataset that is grouped by a. Which of the following best describes the result of the following code? > boxplot(formula = a ~ b) A single box plot of the a dataset. A single box plot of the b dataset. A multiple box plot of the a dataset that is grouped by b. A multiple box plot of the b dataset that is grouped by a.
Read more
  • 0
  • 0
  • 3325

article-image-customizing-graphics-and-creating-bar-chart-and-scatterplot-r
Packt
28 Oct 2010
4 min read
Save for later

Customizing Graphics and Creating a Bar Chart and Scatterplot in R

Packt
28 Oct 2010
4 min read
Statistical Analysis with R Take control of your data and produce superior statistical analysis with R. An easy introduction for people who are new to R, with plenty of strong examples for you to work through This book will take you on a journey to learn R as the strategist for an ancient Chinese kingdom! A step by step guide to understand R, its benefits, and how to use it to maximize the impact of your data analysis A practical guide to conduct and communicate your data analysis with R in the most effective manner       Charts, graphs, and plots in R R features several options for creating charts, graphs, and plots. In this article, we will explore the generation and customization of these visuals, as well as methods for saving and exporting them for use outside of R. The following visuals will be covered in this article: Bar graphs Scatterplots Line charts Box plots Histograms Pie charts Time for action — creating a bar chart A bar chart or bar graph is a common visual that uses rectangles to depict the values of different items. Bar graphs are especially useful when comparing data over time or between diverse groups. Let us create a bar chart in R: Open R and set your working directory: > #set the R working directory > #replace the sample location with one that is relevant to you > setwd("/Users/johnmquick/rBeginnersGuide/") Use the barplot(...) function to create a bar chart: > #create a bar chart that compares the mean durations of the battle methods > #calculate the mean duration of each battle method > meanDurationFire <- mean(subsetFire$DurationInDays) > meanDurationAmbush <- mean(subsetAmbush$DurationInDays) > meanDurationHeadToHead <- mean(subsetHeadToHead$DurationInDays) > meanDurationSurround <- mean(subsetSurround$DurationInDays) > #use a vector to define the chart's bar values > barAllMethodsDurationBars <- c(meanDurationFire, meanDurationAmbush, meanDurationHeadToHead, meanDurationSurround) > #use barplot(...) to create and display the bar chart > barplot(height = barAllMethodsDurationBars) Your chart will be displayed in the graphic window, similar to the following: What just happened? You created your first graphic in R. Let us examine the barplot(...) function that we used to generate our bar chart, along with the new R components that we encountered. barplot(...) We created a bar chart that compared the mean durations of battles between the different combat methods. As it turns out, there is only one required argument in the barplot(...) function. This height argument receives a series of values that specify the length of each bar. Therefore, the barplot(...) function, at its simplest, takes on the following form: barplot(height = heightValues) Accordingly, our bar chart function reflected this same format: > barplot(height = barAllMethodsDurationBars) Vectors We stored the heights of our chart's bars in a vector variable. In R, a vector is a series of data. R's c(...) function can be used to create a vector from one or more data points. For example, the numbers 1, 2, 3, 4, and 5 can be arranged into a vector like so: > #arrange the numbers 1, 2, 3, 4, and 5 into a vector > numberVector <- c(1, 2, 3, 4, 5) Similarly, text data can also be placed into vector form, so long as the values are contained within quotation marks: > #arrange the letters a, b, c, d, and e into a vector > textVector <- c("a", "b", "c", "d", "e") Our vector defined the values for our bars: > #use a vector to define the chart's bar values > barAllMethodsDurationBars <- c(meanDurationFire, meanDurationAmbush, meanDurationHeadToHead, meanDurationSurround) Many function arguments in R require vector input. Hence, it is very common to use and encounter the c(...) function when working in R. Graphic window When you executed your barplot(...) function in the R console, the graphic window opened to display it. The graphic window will have different names across different operating systems, but its purpose and function remain the same. For example, in Mac OS X, the graphic window is named Quartz. For the remainder of this article, all R graphics will be displayed without the graphics window frame, which will allow us to focus on the visuals themselves. Pop quiz When entering text into a vector using the c(...) function, what characters must surround each text value? Quotation marks Parenthesis Asterisks Percent signs   What is the purpose of the R graphic window? To debug graphics functions To execute graphics functions To edit graphics To display graphics
Read more
  • 0
  • 0
  • 17320
article-image-postgresql-9-balancing-hardware-spending
Packt
27 Oct 2010
10 min read
Save for later

PostgreSQL 9: Balancing Hardware Spending

Packt
27 Oct 2010
10 min read
One of the reasons that working with open-source databases such as PostgreSQL can be so effective is that every dollar you save on software licensing can be put toward better hardware instead. The three main components you'll need to balance in your budget are CPUs, memory, and disks, with the disk controller as a related and critical part too. CPUs Currently, available processors are bundling at least two and possibly as many as eight cores into each CPU, making the core count the figure of merit for most database applications. There are two basic decisions you need to make while deciding which CPU solution would best match your database application: Which processor family? Nowadays, this normally boils down to choosing among the various 64 bit product lines from Intel or AMD, although there are some other less popular choices still floating around (Itanium, SPARC, and so on). Do you get more cores or faster cores? These choices are sometimes more tied together than you might think. Currently, Intel has a lead in delivering individual processor cores that are the fastest around, often due to faster transfers between the processor and system RAM. But the processors and related parts are more expensive too. AMD still is competitive at providing more cores per dollar, and their server class designs normally do a good job making the best of the memory available to each core. But if what you want is many more affordable cores instead, that's where AMD is stronger. AMD also has a better history of making its fastest processors available in configurations with many sockets, when you want to put more than two physical CPUs into a single server. The best way to figure out which class of database app you have—more cores or faster cores—is to monitor an existing server using tools such as top. If there's a small number of processes running using a single CPU each, that's the sort of workload where faster cores are better. That tends to happen if you have giant queries running in a batch fashion, for example when large quantities of data need to be sorted to deliver any single report. But if all the CPUs are active with many more concurrent processes instead, then you'd likely benefit better from more cores. That's normally what you'll see in applications with a larger user count, such as databases backing web applications. If you don't have the benefit of a working system to inspect, you can try to guess which type of situation you're in by noting the limitations of the database. PostgreSQL does not allow splitting a single query across more than one core, what's called parallel query by some other databases that support it. That means that if you have any one query or small number of queries that must run as fast as possible, the only way to do that is to prioritize getting faster cores. Another situation where getting a faster core is the better choice is if you need to prioritize data loading or export situations. PostgreSQL's best performing data import method, COPY, can easily become (but isn't always) limited by CPU performance, where that turns into the bottleneck for operations. While it's possible to split input files into pieces and load them in parallel, that's something you'll need to build or acquire yourself, rather than something the server knows how to do for you. Exporting a copy of the database using the pg_dump utility is another example of something that can become CPU limited on some systems. Memory How much to prioritize memory for your application really depends on the size of the working set of data needed to handle the most common operations. Generally, adding more RAM will provide a significant performance boost. There are a few situations where you'd be better served doing something else instead: If your data set is small enough to fit into a smaller amount of RAM, adding more won't help you much. You probably want faster processors instead. When running applications that scan tables much larger than what you can feasibly purchase as RAM, such as in many data warehouse situations, you might be better served by getting faster disks rather than more memory. The normal situation where more memory helps most is when the data you access frequently will fit with the larger amount, but not with the smaller. This happens more often than you might think because of the way database B-tree indexes are stored. Even if you can't fit the entirety of a table or even its index in memory, being able to store a good sized fraction of the index can mean that index-based data lookups will be significantly sped up. Having the most "popular" blocks from the top of the tree structure cached is helpful even if you can't fit all the leaves into memory too. Once you have an application running, you can usually get a much better idea how memory is being used by looking inside the PostgreSQL buffer cache (and potentially inside the operating system one as well) and seeing what data it prefers to keep around. Disks While it's always possible to run into situations where the CPU in your database server is its bottleneck, it's downright likely that you'll run into a disk bottleneck— particularly if you only have a drive or two in the system. A few years ago, the basic two choices in hard drives were the inexpensive ATA (also known as IDE) drives used in desktops versus the more serious SCSI drives aimed at servers. Both technologies have marched forward, and the current choice you're most likely to run into when configuring a database server is whether to use Serial ATA (SATA) or Serial Attached SCSI (SAS). It's possible to find nearly identical drives available in both interfaces, and there are even drive controllers that allow attaching either kind of drive. Combined with a narrowing performance difference between the two, picking between them is harder than ever. The broad parameters of each technology are straightforward to compare. Here's the state of things as this is being written: SAS disks: The maximum available RPM is higher: 10,000 or 15,000 Not as much drive capacity: 73 GB-1 TB are popular sizes Cost per MB is higher SATA disks: Drives typically have a slower RPM: 7200 is standard, some 10,000 designs exist such as the Western Digital VelociRaptor Higher drive capacity: 2 TB available Cost per MB is lower Generally, you'll find individual SAS disks to be faster even than SATA ones with similar specifications. In particular, you're likely to see better seek performance on random I/O due to faster drive mechanics in SAS, and sometimes a faster transfer rate from the disk too. Also, because the SAS drives have supported advanced features such as command queuing for longer, it's more likely your operating system will have matching support to take advantage of them. RAID The Redundant Array of Inexpensive Disks (RAID) approach is the standard way to handle both the performance and reliability limitations of individual disk drives. A RAID array puts many disks, typically of exactly the same configuration, into a set that acts like a single disk—but with either enhanced performance, reliability, or both. In some cases the extra reliability comes from computing what's called parity information for writes to the array. Parity is a form of checksum on the data, which allows reconstructing it even if some of the information is lost. RAID levels that use parity are efficient from a space perspective at writing data in a way that will survive drive failures, but the parity computation overhead can be significant for database applications. The most common basic forms of RAID arrays used are: RAID 0: It is also called as Striping. Multiple disks are used at the same time, spreading reads and writes over each of them in parallel. This can be almost a linear improvement (two disks reading twice as fast as a single one), but a failure on any volume in the set will lose all the data. RAID 1: It is also called as Mirroring. Here more copies of the same data are put onto multiple disks. This can sometimes improve performance—a good RAID 1 mirroring across two disks might handle two reads by sending one to each drive. Reads executed in parallel against both drives can effectively double average seeks per second. But generally, the reason for RAID 1 is redundancy: if a single drive fails, the system will continue operating using the other one. RAID 10 or 1+0: This first takes pairs of disks and mirrors then using RAID 1. Then, the resulting set is striped using RAID 0. The result provides both high performance and the ability to tolerate any single disk failure, without as many ways for speed to suffer in the average and worst case as RAID 5/6. RAID 10 is particularly appropriate for write-heavy environments, where the parity computation overhead of RAID 5/6 can cause disk performance to suffer. Accordingly, it's the preferred RAID level for high-performance database systems. RAID 5: It is also called as Striped with Parity. This approach sits midway between 0 and 1. You stripe data across multiple drives similarly to RAID 0, which improves read performance. But some redundant data is added to a parity drive. If one of the disks in the array is lost, the missing data can be recomputed from the ones left using that parity information. While this is efficient in terms of how little space is wasted relative to the tolerance for disk failures provided, write performance in particular can suffer in RAID 5. RAID 6: Similar to RAID 5, except with more parity information, enabling survival even with two disk failures. It has the same fundamental advantages and disadvantages. RAID 6 is an increasingly common way to cope with the fact that rebuilding a RAID 5 array after a disk loss can take a really long time on modern, high capacity drives. The array has no additional fault tolerance during that period, and seeing a second drive failure before that rebuild finishes is not that unlikely when it takes many hours of intense disk activity to rebuild. Disks manufactured in the same batch are surprisingly likely to fail in groups. To be fair in any disk performance comparison, you need to consider that most systems are going to have a net performance from several disks, such as in a RAID array. Since SATA disks are individually cheaper, you might be able to purchase considerably more of them for the same budget than had you picked SAS instead. If your believe your application will get faster if it is spread over more disks, being able to buy more of them per dollar spent can result in an overall faster system. Note that the upper limit here will often be your server's physical hardware. You only have so many storage bays or controllers ports available, and larger enclosures can cost more both up-front and over their lifetime. It's easy to find situations where smaller numbers of faster drives—which SAS provides—is the better way to go. This is why it's so important to constantly benchmark both hardware and your database application, to get a feel of how well it improves as the disk count increases.
Read more
  • 0
  • 0
  • 5604

article-image-postgresql-9-reliable-controller-and-disk-setup
Packt
27 Oct 2010
10 min read
Save for later

PostgreSQL 9: Reliable Controller and Disk Setup

Packt
27 Oct 2010
10 min read
Accelerate your PostgreSQL system Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance Discover the techniques used to scale successful database installations Avoid the common pitfalls that can slow your system down Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment Covers versions 8.1 through 9.0 PostgreSQL uses a Write-Ahead Log (WAL) to write data in a way that survives a database or hardware crash. This is similar to the log buffer or REDO log found in other databases. The database documentation covers the motivation and implementation of the WAL at http://www.postgresql.org/docs/current/ static/wal.html To quote from that introduction: WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage. This procedure ensures that if your application has received a COMMIT for a transaction, that transaction is on permanent storage, and will not be lost even if there is a crash. This satisfies the durability portion of the ACID (atomicity, consistency, isolation, durability) expectations databases aim to satisfy. Write-back caches The CPUs and memory in your server are quite fast compared to its disk drives. Accordingly, making the rest of the system wait for the disks, particularly when things need to be written out, can drag overall performance down heavily. Systems that wait for the disks to complete their writes before moving into their next task are referred to as having a write-through cache. While the data may be stored temporarily in a memory cache, until it's made it all the way through to the physical disk, any write an application requested isn't considered complete. The normal solution to making that faster is to introduce a different type of write cache between the program doing the writing and disks. A write-back cache is one where data is copied into memory, and then control returns to the application that requested the write. Those writes are then handled asynchronously, at some future time dictated by the design of the write-back cache. It can take minutes before the data actually makes it to disk. When PostgreSQL writes information to the WAL, and sometimes when it writes to the regular database files too, that information must be "flushed to permanent storage" in order for the database's crash corruption defense mechanism to work. So what happens if you have a write-back cache that says the write is complete, but it really isn't? People call these lying drives, and the result can be very bad: If you have a system with a write-back cache, and a system crash causes the contents of that write-back cache to be lost, this can corrupt a PostgreSQL database stored on that drive and make it unusable. You can discover it takes expert intervention to even get the database to start again, and determining what data is damaged will be difficult. Consider the case where you have committed a transaction. Details of that new transaction might be spread across two data blocks on the drive. Now, imagine that one of those made it to disk before the system crashed, but the other didn't. You've now left the database in a corrupted state: one block refers to a transaction that doesn't exist where it's supposed to in the other block. Had at least all of the data blocks related to the WAL been written properly, the database WAL could correct this error after the crash. But the WAL protection only works if it can get honest information about whether information has been written to the disks properly or not, and the "lying" write-back caches do not report that. Sources of write-back caching Servers are filled with write caches you need to be aware of: Operating system write cache. This cache can easily be gigabytes in size. Typically you can flush data out of this cache by forcing a "sync" operation on the block that needs to be stored on disk. On POSIX systems (which includes all UNIX-like ones), this is done with the fsync or fdatasync calls. In some cases, it's possible to write directly in a sync mode, which is effectively a write followed by fsync. The postgresql.conf setting wal_sync_method controls which method is used, and it's possible to disable this altogether to optimize for speed instead of safety. Disk controller write cache. You'll find a write cache on most RAID controller cards, as well as inside external storage such as a SAN. Common sizes right now are 128 MB to 512 MB for cards, but gigabytes are common on a SAN. Typically controllers can be changed to operate in the completely writethrough mode, albeit slowly. But by default, you'll normally find them in write-back mode. Writes that can fit in the controller's cache are stored there, the operating system is told the write is completed, and the card writes the data out at some future time. To keep this write from being lost if power is interrupted, the card must be configured with a battery. That combination is referred to as a battery-backed write cache (BBC or BBWC). Disk drive write cache. All SATA and SAS disks have a write cache on them that on current hardware is 8 MB to 32 MB in size. This cache is always volatile: if power is lost, any data stored in there will be lost and they're always write-back caches if enabled. How can you make sure you're safe given all these write-back caches that might lose your data? There are a few basic precautions to take: Make sure whatever file system you're using properly implements fsync calls, or whatever similar mechanism is used, fully. Monitor your driver controller battery. Some controller cards will monitor their battery health, and automatically switch from write-back to writethough mode when there is no battery or it's not working properly. That's a helpful safety measure, but performance is going to drop hard when this happens. Disable any drive write caches. Most hardware RAID controllers will do this for you, preferring their own battery-backed caches instead. Disk controller monitoring When you have a RAID controller card with a battery-backed cache, you probably expect you'll need to monitor the card to determine when disks fail. But monitoring controller battery health is an equally critical aspect of maintaining a reliable database system when you're using this technology. If the battery fails and you're using it in write-back mode, your writes are not safe. Similarly, if your power fails, you should prefer shutting the database server down after a few minutes of power loss to trying to keep it going. Integrating in power monitoring via a UPS or similar mechanism should be part of your database server configuration, so that a short outage results in an orderly shutdown. Consider the purpose of the controller battery to protect yourself from really unexpected outages, like someone tripping over the power cord. Even if the manufacturer claims the controller battery will last through days of downtime, that's not a good reason to believe you'll actually be safe for an extended outage. You should consider the battery as something you'd prefer to only use for some number of minutes of protection. That may be the reality, particularly in a case where the battery is older and has lost much of its capacity, and some controller batteries don't start out with very much capacity. Be sure to run your own tests rather than blindly believing the manufacturer specifications: your data depends on it. Better RAID controllers will automatically disable write-back mode if their battery stops working normally. If performance suddenly drops on an older server, this is one potential cause. Also don't forget that every UPS has a battery that degrades over time as well. That's all the more reason to arrange an orderly shutdown of your server during a power outage, rather than optimistically presuming you can keep it running until power returns. Disabling drive write caches If your card doesn't disable all the drive write caches, or if you're using a software RAID approach, you'll need to turn the caches off yourself. The best way to do this is to see if it's possible to change the default write cache state using the utilities provided by the drive manufacturer. You should be able to do this through software as well. Here is a sample session from a Linux system checking the write cache, toggling it off, confirming that change took, and then toggling it on again: # hdparm -I /dev/sda | grep "Write cache" * Write cache # sudo hdparm -W 0 /dev/sda /dev/sda: setting drive write-caching to 0 (off) write-caching = 0 (off) # hdparm -I /dev/sda | grep "Write cache" Write cache # hdparm -W 1 /dev/sda /dev/sda: setting drive write-caching to 1 (on) write-caching = 1 (on) Only the -W 0 configuration is completely safe for database use. The PostgreSQL WAL documentation suggests similar commands to use for other operating systems. Performance impact of write-through caching If you don't have a battery-backed write cache, and therefore can't utilize some memory-based cache to accelerate fsync writes, commit performance on your database can be quite bad. The worst-case here is where you have a single client that is issuing a commit after every statement it executes. The reality of how a hard drive works means that individual writes happen once each time the drive spins around. Here are the measurements for the common drive speeds available right now, with computed maximum commit rate:     Rotation speed Rotation time (ms) Max commits/second 5400 11.1 90 7200 8.3 120 10000 6.0 166 15000 4.0 250 It's important to realize how limiting this can be: If you have a common 7200 rpm hard drive, no single client can commit more than 120 transactions/second in any situation where all that's available is a write-back cache. It doesn't matter how many disks you have in a RAID array, or how you configure your software. You must have hardware with a battery, enabling a non-volatile write-back cache, in order to safely exceed this limit. Some PostgreSQL installs use a RAID controller card just for this purpose, to provide a BBWC, in Just a Bunch of Disks (JBOD) mode—where no RAID is being done on the controller at all. Sometimes disks are used directly, and others layer software RAID on top, which can have some advantages compared to hardware RAID. If you have more than one client, you'll get more done per commit. It's normal to see >500 committed transactions per second if you have a larger number of clients all committing regularly, because each flushed disk write will include any queued up commit requests from other clients, too. The other common technique here is to batch commits into larger pieces, perhaps going 1000 records at a time rather than a single one, in order to reduce the average impact of commit latency. Another approach for accelerating systems that don't have a usable write cache is asynchronous commit. Summary In this article we saw how accidentally using volatile write-back caching in disk controllers and drives can easily introduce database corruption. Further resources on this subject: UNIX Monitoring Tool for PostgreSQL [Article] Server Configuration Tuning in PostgreSQL [Article] PostgreSQL 9: Balancing Hardware Spending [Article]
Read more
  • 0
  • 0
  • 2665

article-image-backup-postgresql-9
Packt
25 Oct 2010
11 min read
Save for later

Backup in PostgreSQL 9

Packt
25 Oct 2010
11 min read
Most people admit that backups are essential, though they also devote only a very small amount of time to thinking about the topic. The first recipe is about understanding and controlling crash recovery. We need to understand what happens if the database server crashes, so we can understand when we might need to recover. The next recipe is all about planning. That's really the best place to start before you go charging ahead to do backups. Understanding and controlling crash recovery Crash recovery is the PostgreSQL subsystem that saves us if the server should crash, or fail as a part of a system crash. It's good to understand a little about it and to do what we can to control it in our favor. How to do it... If PostgreSQL crashes there will be a message in the server log with severity-level PANIC. PostgreSQL will immediately restart and attempt to recover using the transaction log or Write Ahead Log (WAL). The WAL consists of a series of files written to the pg_xlog subdirectory of the PostgreSQL data directory. Each change made to the database is recorded first in WAL, hence the name "write-ahead" log. When a transaction commits, the default and safe behavior is to force the WAL records to disk. If PostgreSQL should crash, the WAL will be replayed, which returns the database to the point of the last committed transaction, and thus ensures the durability of any database changes. Note that the database changes themselves aren't written to disk at transaction commit. Those changes are written to disk sometime later by the "background writer" on a well-tuned server. Crash recovery replays the WAL, though from what point does it start to recover? Recovery starts from points in the WAL known as "checkpoints". The duration of crash recovery depends upon the number of changes in the transaction log since the last checkpoint. A checkpoint is a known safe starting point for recovery, since at that time we write all currently outstanding database changes to disk. A checkpoint can become a performance bottleneck on busy database servers because of the number of writes required. There are a number of ways of tuning that, though please also understand the effect on crash recovery that those tuning options may cause. Two parameters control the amount of WAL that can be written before the next checkpoint. The first is checkpoint_segments, which controls the number of 16 MB files that will be written before a checkpoint is triggered. The second is time-based, known as checkpoint_timeout, and is the number of seconds until the next checkpoint. A checkpoint is called whenever either of those two limits is reached. It's tempting to banish checkpoints as much as possible by setting the following parameters: checkpoint_segments = 1000 checkpoint_timeout = 3600 Though if you do you might give some thought to how long the recovery will be if you do and whether you want that. Also, you should make sure that the pg_xlog directory is mounted on disks with enough disk space for at least 3 x 16 MB x checkpoint_segments. Put another way, you need at least 32 GB of disk space for checkpoint_segments = 1000. If wal_keep_segments > 0 then the server can also use up to 16MB x (wal_keep_segments + checkpoint_segments). How it works... Recovery continues until the end of the transaction log. We are writing this continually, so there is no defined end point; it is literally the last correct record. Each WAL record is individually CRC checked, so we know whether a record is complete and valid before trying to process it. Each record contains a pointer to the previous record, so we can tell that the record forms a valid link in the chain of actions recorded in WAL. As a result of that, recovery always ends with some kind of error reading the next WAL record. That is normal. Recovery performance can be very fast, though it does depend upon the actions being recovered. The best way to test recovery performance is to setup a standby replication server. There's more... It's possible for a problem to be caused replaying the transaction log, and for the database server to fail to start. Some people's response to this is to use a utility named pg_resetxlog, which removes the current transaction log files and tidies up after that surgery has taken place. pg_resetxlog destroys data changes and that means data loss. If you do decide to run that utility, make sure you take a backup of the pg_xlog directory first. My advice is to seek immediate assistance rather than do this. You don't know for certain that doing this will fix a problem, though once you've done it, you will have difficulty going backwards. Planning backups This section is all about thinking ahead and planning. If you're reading this section before you take a backup, well done. The key thing to understand is that you should plan your recovery, not your backup. The type of backup you take influences the type of recovery that is possible, so you must give some thought to what you are trying to achieve beforehand. If you want to plan your recovery, then you need to consider the different types of failures that can occur. What type of recovery do you wish to perform? You need to consider the following main aspects: Full/Partial database? Everything or just object definitions only? Point In Time Recovery Restore performance We need to look at the characteristics of the utilities to understand what our backup and recovery options are. It's often beneficial to have multiple types of backup to cover the different types of failure possible. Your main backup options are logical backup—using pg_dump physical backup—file system backup pg_dump comes in two main flavors: pg_dump and pg_dumpall. pg_dump has a -F option to produce backups in various file formats. The file format is very important when it comes to restoring from backup, so you need to pay close attention to that. The following table shows the features available, depending upon the backup technique selected. Table of Backup/Recovery options: SQL dump to an archive file pg_dump -F cSQL dump to a script file pg_dump -F p or pg_dumpallFilesystem backup using pg_start_ backupBackup typeLogicalLogicalPhysicalRecover to point in time?NoNoYesBackup all databases?One at a timeYes (pg_dumpall)YesAll databases backed up at same time?NoNoYesSelective backup?YesYesNo (Note 3)Incremental backup?NoNoPossible (Note 4)Selective restore?YesPossible (Note 1)No (Note 5)DROP TABLE recoveryYes Yes Possible (Note 6) DROP TABLESPACE recovery Possible (Note 2)Possible (Note 6)Possible (Note 6)Compressed backup files?YesYesYesBackup is multiple files?NoNoYesParallel backup possible?NoNoYesParallel restore possible?YesNoYesRestore to later release?YesYesNoStandalone backup?YesYesYes (Note 7)Allows DDL during backupNoNoYes How to do it... If you've generated a script with pg_dump or pg_dumpall and need to restore just a single object, then you're going to need to go deep. You will need to write a Perl script (or similar) to read the file and extract out the parts you want. It's messy and time-consuming, but probably faster than restoring the whole thing to a second server, and then extracting just the parts you need with another pg_dump. See recipe Recovery of a dropped/damaged tablespace. Selective backup with physical backup is possible, though will cause later problems when you try to restore. Selective restore with physical backup isn't possible with currently supplied utilities. See recipe for Standalone hot physical backup How it works... To backup all databases, you may be told you need to use the pg_dumpall utility. I have four reasons why you shouldn't do that, which are as follows: If you use pg_dumpall, then the only output produced is into a script file. Script files can't use the parallel restore feature of pg_restore, so by taking your backup in this way you will be forcing the restore to be slower than it needs to be. pg_dumpall produces dumps of each database, one after another. This means that: pg_dumpall is slower than running multiple pg_dump tasks in parallel, one against each database. The dumps of individual databases are not consistent to a particular point in time. If you start the dump at 04:00 and it ends at 07:00 then we're not sure exactly when the dump relates to—sometime between 0400 and 07:00. Options for pg_dumpall are similar in many ways to pg_dump, though not all of them exist, so some things aren't possible. In summary, pg_dumpall is slower to backup, slow to restore, and gives you less control over the dump. I suggest you don't use it for those reasons. If you have multiple databases, then I suggest you take your backup by doing either. Dump global information for the database server using pg_dumpall -g. Then dump all databases in parallel using a separate pg_dump for each database, taking care to check for errors if they occur. Use the physical database backup technique instead. Hot logical backup of one database Logical backup makes a copy of the data in the database by dumping out the contents of each table. How to do it... The command to do this is simple and as follows: pg_dump -F c > dumpfile or pg_dump -F c –f dumpfile You can also do this through pgAdmin3 as shown in the following screenshot: How it works... pg_dump produces a single output file. The output file can use the split(1) command to separate the file into multiple pieces if required. pg_dump into the custom format is lightly compressed by default. Compression can be removed or made more aggressive. pg_dump runs by executing SQL statements against the database to unload data. When PostgreSQL runs an SQL statement we take a "snapshot" of currently running transactions, which freezes our viewpoint of the database. We can't (yet) share that snapshot across multiple sessions, so we cannot run an exactly consistent pg_dump in parallel in one database, nor across many databases. The time of the snapshot is the only time we can recover to—we can't recover to a time either before or after that time. Note that the snapshot time is the start of the backup, not the end. When pg_dump runs, it holds the very lowest kind of lock on the tables being dumped. Those are designed to prevent DDL from running against the tables while the dump takes place. If a dump is run at the point that other DDL are already running, then the dump will sit and wait. If you want to limit the waiting time you can do that by setting the –-lock-wait-timeout option. pg_dump allows you to make a selective backup of tables. The -t option also allows you to specify views and sequences. There's no way to dump other object types individually using pg_dump. You can use some supplied functions to extract individual snippets of information available at the following website: https://www.postgresql.org/docs/9.0/static/functions-info.html#FUNCTIONS-INFO-CATALOG-TABLE pg_dump works against earlier releases of PostgreSQL, so it can be used to migrate data between releases. pg_dump doesn't generally handle included modules very well. pg_dump isn't aware of additional tables that have been installed as part of an additional package, such as PostGIS or Slony, so it will dump those objects as well. That can cause difficulties if you then try to restore from the backup, as the additional tables may have been created as part of the software installation process in an empty server. There's more... What time was the pg_dump taken? The snapshot for a pg_dump is taken at the beginning of a run. The file modification time will tell you when the dump finished. The dump is consistent at the time of the snapshot, so you may want to know that time. If you are making a script dump, you can do a dump verbose as follows: pg_dump -v which then adds the time to the top of the script. Custom dumps store the start time as well and that can be accessed using the following: pg_restore --schema-only -v dumpfile | head | grep Started -- Started on 2010-06-03 09:05:46 BST See also Note that pg_dump does not dump the roles (such as users/groups) and tablespaces. Those two things are only dumped by pg_dumpall; see the next recipes for more detailed descriptions.
Read more
  • 0
  • 0
  • 3214
article-image-introduction-postgresql-9
Packt
25 Oct 2010
12 min read
Save for later

Introduction to PostgreSQL 9

Packt
25 Oct 2010
12 min read
PostgreSQL 9 Admin Cookbook Over 80 recipes to help you run an efficient PostgreSQL 9.0 database Administer and maintain a healthy database Monitor your database ensuring that it performs as quickly as possible Tips for backup and recovery of your database Introduction PostgreSQL is a feature-rich general purpose database management system. It's a complex piece of software, but every journey begins with the first step. We start with your first connection. Many people fall at the first hurdle, so we try not to skip too swiftly past that. We move on quickly to enabling remote users, and from there to access through GUI administration tools. We also introduce the psql query tool, which is the tool used for loading our sample database. Introducing PostgreSQL 9 PostgreSQL is an advanced SQL database server, available on a wide range of platforms. One of the clearest benefits of PostgreSQL is that it is open source, meaning that you have a very permissive license to install, use, and distribute PostgreSQL without paying anyone fees or royalties. On top of that, PostgreSQL is well-known as a database that stays up for long periods, and requires little or no maintenance in many cases. Overall, PostgreSQL provides a very low total cost of ownership. PostgreSQL is also noted for its huge range of advanced features, developed over the course of more than 20 years continuous development and enhancement. Originally developed by the Database Research group at the University of California, Berkeley, PostgreSQL is now developed and maintained by a huge army of developers and contributors. Many of those contributors have full-time jobs related to PostgreSQL, working as designers, developers, database administrators, and trainers. Some, but not many, of those contributors work for companies that specialize in services for PostgreSQL, such as Hannu and me. No single company owns PostgreSQL, nor are you required, or even encouraged, to register your usage. PostgreSQL has the following main features: Excellent SQL Standards compliance up to SQL 2008 Client-server architecture Highly concurrent design where readers and writers don't block each other Highly configurable and extensible for many types of application Excellent scalability and performance with extensive tuning features What makes PostgreSQL different? The PostgreSQL project focuses on the following objectives: Robust, high-quality software with maintainable, well-commented code Low maintenance administration for both embedded and enterprise use Standards-compliant SQL, interoperability, and compatibility Performance, security, and high availability What surprises many people is that PostgreSQL's feature set is more comparable with Oracle or SQL Server than it is with MySQL. The only connection between MySQL and PostgreSQL is that those two projects are open source; apart from that, the features and philosophies are almost totally different. One of the key features of Oracle since Oracle 7 has been "snapshot isolation", where readers don't block writers, and writers don't block readers. You may be surprised to learn that PostgreSQL was the first database to be designed with this feature, and offers a full and complete implementation. PostgreSQL names this Multi-Version Concurrency Control (MVCC). PostgreSQL is a general-purpose database management system. You defi ne the database that you would like to manage with it. PostgreSQL offers you many ways to work. You can use a "normalized database model", you can utilize extensions such as arrays and record subtypes, or you can use a fully dynamic schema using an extension named hstore . PostgreSQL also allows you to create your own server-side functions in one of a dozen different languages. PostgreSQL is highly extensible, so you can add your own datatypes, operators, index types, and functional languages. For example, you can override different parts of the system using plugins to alter the execution of commands or add a new optimizer. All of these features offer a huge range of implementation options to software architects. There are many ways out of trouble when building applications and maintaining them over long periods of time. In the early days, when PostgreSQL was still a research database, the focus was solely on cool new features. Over the last 15 years, enormous amounts of code have been rewritten and improved, giving us one of the most stable, large, software servers available for operational use. You may also read that PostgreSQL was, or is, slower than My Favorite DBMS, whichever one that is. It's been a personal mission of mine over the last six years to improve server performance and the team have been successful in making the server highly performant and very scalable. That gives PostgreSQL enormous headroom for growth. Who is using PostgreSQL? Prominent users include Apple, BASF, Genentech, IMDB.com, Skype, NTT, Yahoo, and The National Weather Service. PostgreSQL receives well in excess of 1 million downloads per year, according to data submitted to the European Commission, who concluded "...PostgreSQL, is considered by many database users to be a credible alternative... We need to mention one last thing. When PostgreSQL was fi rst developed, it was named Postgres, and so many aspects of the project still refer to the word "postgres". For example, the default database is named postgres, and the software is frequently installed using the postgres userid. As a result, people shorten the name PostgreSQL to simply Postgres, and in many cases people use the two names interchangeably. PostgreSQL is pronounced as "post-grez-q-l". Postgres is pronounced as "post-grez". Some people get confused, and refer to "Postgre", which is hard to say, and likely to confuse people. Two names are enough, so please don't use a third name! Getting PostgreSQL PostgreSQL is 100% open source software. PostgreSQL is freely available to use, alter, or redistribute in any way you choose. PostgreSQL's license is an approved open source license very similar to the BSD (Berkeley Distribution Software) license, though only just different enough that it is now known as TPL (The PostgreSQL License). How to do it... PostgreSQL is already in use by many different application packages, and so you may already find it installed on your servers. Many Linux distributions include PostgreSQL as part of the basic installation, or include it with the installation disk. One thing to be wary of is that the version of PostgreSQL included may not be the latest release. It will typically be the latest major release that was available when that operating system release was published. There is usually no good reason to stick at that level—there is no increased stability implied there and later production versions are just as well-supported by the various Linux distributions. If you don't yet have a copy, or you don't have the latest version, you can download the source code or download binary packages for a wide variety of operating systems from the following URL: http://www.postgresql.org/download/ Installation details vary significantly from platform-to-platform and there aren't any special tricks or recipes to mention. Please, just follow the installation guide, and away you go. We've consciously avoided describing the installation processes here to make sure we don't garble or override the information published to assist you. If you would like to receive e-mail updates of the latest news, then you can subscribe to the PostgreSQL announce mailing list, which contains updates from all the vendors that support PostgreSQL. You'll get a few e-mails each month about new releases of core PostgreSQL and related software, conferences, and user group information. It's worth keeping in touch with developments. For more information about the PostgreSQL announce mailing list, visit the following URL: http://archives.postgresql.org/pgsql-announce/ How it works... Many people ask questions, such as "How can this be free?", "Are you sure I don't have to pay someone?", or "Who gives this stuff away for nothing?" Open source applications such as PostgreSQL work on a community basis, where many contributors perform tasks that make the whole process work. For many of those people, their involvement is professional, rather a hobby, and they can do this because there is generally a great value for both contributors and their employers alike. You might not believe it. You don't have to because It Just Works. There's more... Remember that PostgreSQL is more than just the core software. There is a huge range of websites offering add-ons, extensions, and tools for PostgreSQL. You'll also fi nd an army of bloggers describing useful tricks and discoveries that will help you in your work.   And, there is a range of professional companies able to offer you help when you need it. Connecting to PostgreSQL server How do we access PostgreSQL? Connecting to the database is most people's first experience of PostgreSQL, so we want to make it a good one. So, let's do it, and fix any problems we have along the way. Remember that a connection needs to be made securely, so there may be some hoops for us to jump through to ensure that the data we wish to access is secure. Before we can execute commands against the database, we need to connect to the database server, giving us a session. Sessions are designed to be long-lived, so you connect once, perform many requests, and then eventually disconnect. There is a small overhead during connection. That may become noticeable if you connect/disconnect repeatedly, so you may wish to investigate the use of connection pools. Connection pools allow pre-connected sessions to be served quickly to you when you wish to reconnect. Getting ready First, catch your database. If you don't know where it is, we'll probably have difficulty accessing it. There may be more than one, and you'll need to know the right database to access, and have the authority to connect to it. How to do it... You need to specify the following fi ve parameters to connect to PostgreSQL: host or host address port database name user password (or other means of authentication, if any) To connect, there must be a PostgreSQL server running on host, listening on port number port. On that server, a database named dbname and user must also exist. The host must explicitly allow connections from your client—this is explained in the next recipe, and you must also pass authentication using the method the server specifi es. For example, specifying a password won't work if the server has requested a different form of authentication. Almost all PostgreSQL interfaces use the libpq interface library . When using libpq, most of the connection parameter handling is identical, so we can just discuss that once. If you don't specify the preceding parameters, we look for values set through environment variables, which are as follows: PGHOST or PGHOSTADDR PGPORT (or set to 5432 if this is not set) PGDATABASE PGUSER PGPASSWORD (though this one is defi nitely not recommended) If you specify the first four parameters somehow, but not the password, then we look for a password file. Some PostgreSQL interfaces use the client-server protocol directly, so the way defaults are handled may differ. The information we need to supply won't vary signifi cantly, so please check the exact syntax for that interface. How it works... The PostgreSQL server is a client-server database. The system it runs on is known as the host. We can access the PostgreSQL server remotely through the network. However, we must specify the host, which is a hostname, or a hostaddr , which is an IP address. We can specify a host of "localhost" if we wish to make a TCP/IP connection to the same system. It is often better to use a Unix socket connection, which is attempted if the host begins with a slash (/) and the name is presumed to be a directory name (default is /tmp). On any system, there can be more than one database server. Each database server listens on exactly one "well-known" network port , which cannot be shared between servers on the same system. The default port number for PostgreSQL is 5432, which has been registered with IANA, and is uniquely assigned to PostgreSQL. (You can see it used in the /etc/services file on most *nix servers). The port number can be used to uniquely identify a specific database server if many exist. A database server is also sometimes known as a "database cluster", because the PostgreSQL server allows you to define one or more databases on each server. Each connection request must identify exactly one database identifi ed by its dbname. When you connect, you will only be able to see database objects created within that database. A database user is used to identify the connection. By default, there is no limit on the number of connections for a particular user. In more recent versions of PostgreSQL, users are referred to as login roles, though many clues remind us of the earlier naming, and it still makes sense in many ways. A login role is a role that has been assigned the CONNECT privilege. Each connection will typically be authenticated in some way. This is defined at the server, so is not optional at connection time if the administrator has confi gured the server to require authentication. Once you've connected, each connection can have one active transaction at a time and one fully active statement at any time. The server will have a defined limit on the number of connections it can serve, so a connection request can be refused if the server is oversubscribed. Inspecting your connection information If you want to confirm you've connected to the right place and in the right way, you can execute some or all of the following commands: SELECT inet_server_port(); This shows the port on which the server is listening. SELECT current_database(); Shows the current database. SELECT current_user; This shows the current userid. SELECT inet_server_addr(); Shows the IP address of the server that accepted the connection. A user's password is not accessible using general SQL for obvious reasons. You may also need the following: SELECT version(); See also There are many other snippets of information required to understand connections. Some of those are mentioned in this article. For further details, please consult the PostgreSQL server documentation.
Read more
  • 0
  • 0
  • 2182

article-image-recovery-postgresql-9
Packt
25 Oct 2010
14 min read
Save for later

Recovery in PostgreSQL 9

Packt
25 Oct 2010
14 min read
Recovery of all databases Recovery of a complete database server, including all of its databases, is an important feature. This recipe covers how to do that in the simplest way possible. Getting ready Find a suitable server on which to perform the restore. Before you recover onto a live server, always take another backup. Whatever problem you thought you had could be just about to get worse. How to do it... LOGICAL (from custom dump -F c): Restore of all databases means simply restoring each individual database from each dump you took. Confirm you have the correct backup before you restore: pg_restore --schema-only -v dumpfile | head | grep Started Reload globals from script file as follows: psql -f myglobals.sql Reload all databases. Create the databases using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems. Note that there is a separate dumpfile for each database. pg_restore -d postgres -j 4 dumpfile LOGICAL (from script dump created by pg_dump –F p): As above, though with this command to execute the script. This can be executed remotely without needing to transfer dumpfile between systems. Confirm you have the correct backup before you restore. If the following command returns nothing, then the file is not timestamped, and you'll have to identify it in a different way: head myscriptdump.sql | grep Started Reload globals from script file as follows: psql -f myglobals.sql Reload all scripts like the following: psql -f myscriptdump.sql LOGICAL (from script dump created by pg_dumpall): We need to follow the procedure, which is shown next. Confirm you have the correct backup before you restore. If the following command returns nothing, then the file is not timestamped, and you'll have to identify it in a different way: head myscriptdump.sql | grep Started Find a suitable server, or create a new virtual server. Reload script in full psql -f myscriptdump.sql PHYSICAL: Restore the backup file onto the target server. Extract the backup file into the new data directory. Confirm that you have the correct backup before you restore. $ cat backup_label START WAL LOCATION: 0/12000020 (file 000000010000000000000012) CHECKPOINT LOCATION: 0/12000058 START TIME: 2010-06-03 19:53:23 BST LABEL: standalone Check all file permissions and ownerships are correct and links are valid. That should already be the case if you are using the postgres userid everywhere, which is recommended. Start the server That procedure is so simple. That also helps us understand that we need both a base backup and the appropriate WAL files. If you used other techniques, then we need to step through the tasks to make sure we cover everything required as follows: Shutdown any server running in the data directory. Restore the backup so that any files in the data directory that have matching names are replaced with the version from the backup. (The manual says delete all files and then restore backup—that might be a lot slower than running an rsync between your backup and the destination without the –-update option). Remember that this step can be performed in parallel to speed things up, though it is up to you to script that. Check that all file permissions and ownerships are correct and links are valid. That should already be the case if you are using the postgres userid everywhere, which is recommended. Remove any files in pg_xlog/. Copy in any latest WAL files from a running server, if any. Add in a recovery.conf and set its file permissions correctly also. Start the server. The only part that requires some thought and checking is which parameters you select for the recovery.conf. There's only one that matters here, and that is the restore_command. restore_command tells us how to restore archived WAL files. It needs to be the command that will be executed to bring back WAL files from the archive. If you are forward-thinking, there'll be a README.backup file for you to read to find out how to set the restore_command. If not, then presumably you've got the location of the WAL files you've been saving written down somewhere. Say, for example, that your files are being saved to a directory named /backups/pg/servername/archive, owned by the postgres user. On a remote server named backup1, we would then write this all on one line of the recovery.conf as follows: restore_command = 'scp backup1:/backups/pg/servername/archive/%f %p' How it works... PostgreSQL is designed to require very minimal information to perform a recovery. We try hard to wrap all the details up for you. Logical recovery: Logical recovery executes SQL to re-create the database objects. If performance is an issue, look at the recipe on recovery performance. Physical recovery: Physical recovery re-applies data changes at the block level so tends to be much faster than logical recovery. Physical recovery requires both a base backup and a set of archived WAL files. There is a file named backup_label in the data directory of the base backup. This tells us to retrieve a .backup file from the archive that contains the start and stop WAL locations of the base backup. Recovery then starts to apply changes from the starting WAL location, and must proceed as far as the stop address for the backup to be valid. After recovery completes, the recovery.conf file is renamed to recovery.done to prevent the server from re-entering recovery. The server log records each WAL file restored from the archive, so you can check progress and rate of recovery. You can query the archive to find out the name of the latest archived WAL file to allow you to calculate how many files to go. The restore_command should return 0 if a file has been restored and non-zero for failure cases. Recovery will proceed until there is no next WAL file, so there will eventually be an error recorded in the logs. If you have lost some of the WAL files, or they are damaged, then recovery will stop at that point. No further changes after that will be applied, and you will likely lose those changes; that would be the time to call your support vendor. There's more... You can start and stop the server once recovery has started without any problem. It will not interfere with the recovery. You can connect to the database server while it is recovering and run queries, if that is useful. That is known as Hot Standby mode. Recovery to a point in time If your database suffers a problem at 15:22 p.m. and yet your backup was taken at 04:00 a.m. you're probably hoping there is a way to recover the changes made between those two times. What you need is known as "point-in-time recovery". Regrettably, if you've made a backup with pg_dump at 04:00 a.m. then you won't be able to recover to any other time than 04:00. As a result, the term point-in-time recovery (PITR) has become synonymous with the physical backup and restore technique in PostgreSQL. Getting ready If you have a backup made with pg_dump, then give up all hope of using that as a starting point for a point in time recovery. It's a frequently asked question, but the answer is still "no"; the reason it gets asked is exactly why I'm pleading with you to plan your backups ahead of time. First, you need to decide what the point of time is that to which you would like to recover. If the answer is "as late as possible", then you don't need to do a PITR at all, just recover until end of logs. How to do it... How do you decide to what point to recover? The point where we stop recovery is known as the "recovery target". The most straightforward way is to do this based upon a timestamp. In the recovery.conf, you can add (or uncomment) a line that says the following: recovery_target_time = '2010-06-01 16:59:14.27452+01' or similar. Note that you need to be careful to specify the time zone of the target, so that it matches the time zone of the server that wrote the log. That might differ from the time zone of the current server, so check. After that, you can check progress during a recovery by running queries in Hot Standby mode. How it works... Recovery works by applying individual WAL records. These correspond to individual block changes, so there are many WAL records to each transaction. The final part of any successful transaction is a commit WAL record, though there are abort records as well. Each transaction completion record has a timestamp on it that allows us to decide whether to stop at that point or not. You can also define a recovery target using a transaction id (xid), though finding out which xid to use is somewhat difficult, and you may need to refer to external records if they exist. The recovery target is specified in the recovery.conf and cannot change while the server is running. If you want to change the recovery target, you can shutdown the server, edit the recovery.conf, and then restart the server. Be careful though, if you change the recovery target and recovery is already passed the point, it can lead to errors. If you define a recovery_target_timestamp that has already passed, then recovery will stop almost immediately, though this will be later than the correct stopping point. If you define a recovery_target_xid that has already passed, then recovery will just continue to the end of logs. Restarting recovery from the beginning using a fresh restore of the base backup is always safe. Once a server completes recovery, it will assign a new "timeline". Once a server is fully available, we can write new changes to the database. Those changes might differ from changes we made in a previous "future history" of the database. So we differentiate between alternate futures using different timelines. If we need to go back and run recovery again, we can create a new server history using the original or subsequent timelines. The best way to think about this is that it is exactly like a Sci-Fi novel—you can't change the past but you can return to an earlier time and take a different action instead. But you'll need to be careful not to confuse yourself. There's more... pg_dump cannot be used as a base backup for a PITR. The reason is that a log replay contains the physical changes to data blocks, not logical changes based upon Primary Keys. If you reload a pg_dump the data will likely go back into different data blocks, so the changes wouldn't correctly reference the data. WAL doesn't contain enough information to reconstruct all SQL fully that produced those changes. Later feature additions to PostgreSQL may add the required information to WAL. See also Planned in 9.1 is the ability to pause/resume/stop recovery, and to set recovery targets while the server is up dynamically. This will allow you to use the Hot Standby facility to locate the correct stopping point more easily. You can trick Hot Standby into stopping recovery, which may help. Recovery of a dropped/damaged table You may drop or even damage a table in some way. Tables could be damaged for physical reasons, such as disk corruption, or they could also be damaged by running poorly specified UPDATEs/DELETEs, which update too many rows or overwrite critical data. It's a common request to recover from this situation from a backup. How to do it... The methods differ, depending upon the type of backup you have available. If you have multiple types of backup, you have a choice. LOGICAL (from custom dump -F c): If you've taken a logical backup using pg_dump into a custom file, then you can simply extract the table you want from the dumpfile like the following: pg_restore -t mydroppedtable dumpfile | psql or connect direct to the database using –d. The preceding command tries to re-create the table and then load data into it. Note that pg_restore -t option does not dump out any of the indexes on the table selected. That means we need a slightly more complex procedure than it would first appear, and the procedure needs to vary depending upon whether we are repairing a damaged table or putting back a dropped table. To repair a damaged table we want to replace the data in the table in a single transaction. There isn't a specific option to do this, so we need to do the following: Dump the table to a script file as follows: pg_restore -t mydroppedtable dumpfile > mydroppedtable.sql Edit a script named restore_mydroppedtable.sql with the following code: BEGIN; TRUNCATE mydroppedtable; i mydroppedtable.sql COMMIT; Then, run it using the following: psql -f restore_mydroppedtable.sql If you've dropped a table then you need to: Create a new database in which to work, name it restorework, as follows: CREATE DATABASE restorework; Restore the complete schema to the new database as follows: pg_restore --schema-only -d restorework dumpfile Now, dump just the definitions of the dropped table into a new file, which will contain CREATE TABLE, indexes, other constraints and grants. Note that this database has no data in it, so specifying –-schema-only is optional, as follows: pg_dump -t mydroppedtable --schema-only restorework > mydroppedtable.sql Now, recreate the table on the main database as follows: psql -f mydroppedtable.sql Now, reload just the data into database maindb as follows pg_restore -t mydroppedtable --data-only -d maindb dumpfile If you've got a very large table, then the fourth step can be a problem, because it builds the indexes as well. If you want you can manually edit the script into two pieces, one before the load ("pre-load") and one after the load ("post-load"). There are some ideas for that at the end of the recipe. LOGICAL (from script dump): The easy way to restore a single table from a script is as follows: Find a suitable server, or create a new virtual server. Reload the script in full, as follows: psql -f myscriptdump.sql From the recovered database server, dump the table, its data, and all the definitions of the dropped table into a new file as follows: pg_dump -t mydroppedtable -F c mydatabase > dumpfile Now, recreate the table into the original server and database, using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems. pg_restore -d mydatabase -j 2 dumpfile The only way to extract a single table from a script dump without doing all of the preceding is to write a custom Perl script to read and extract just the parts of the file you want. That can be complicated, because you may need certain SET commands at the top of the file, the table, and data in the middle of the file, and the indexes and constraints on the table are near the end of the file. It's complex; the safe route is the one already mentioned. PHYSICAL: To recover a single table from a physical backup, we need to: Find a suitable server, or create a new virtual server. Recover the database server in full, as described in previous recipes on physical recovery, including all databases and all tables. You may wish to stop at a useful point in time. From the recovered database server, dump the table, its data, and all the definitions of the dropped table into a new file as follows: pg_dump -t mydroppedtable -F c mydatabase > dumpfile Now, recreate the table into the original server and database using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems as follows: pg_restore -d mydatabase -j 2 dumpfile How it works... At present, there's no way to restore a single table from a physical restore in just a single step. See also Splitting a pg_dump into multiple sections, "pre" and "post" was proposed by me for an earlier release of PostgreSQL, though I haven't had time to complete that yet. It's possible to do that using an external utility also; the best script I've seen to split a dump file into two pieces is available at the following website: http://bucardo.org/wiki/split_postgres_dump
Read more
  • 0
  • 0
  • 5403
Modal Close icon
Modal Close icon