Gearman is not only a way to process jobs but also a way to build powerful distributed job processing to help horizontally scale your application. We will discuss how to use Gearman to solve a variety of problems such as using the MapReduce methodology to process large amounts of data, build a pipeline of complex loosely coupled processes that work together to process data using different languages and libraries, and offload long-running and complex data analysis and provide real-time feedback to a frontend web application. This section will be a hands-on exploration of how you can leverage both synchronous and asynchronous workers to expand the capability and capacity of your applications, as well as introduce you to the concept of persistent queues, why you might consider using them, and how they will affect your application.
In order to understand some of the concepts in this section, you need to know about the lifecycle of a task inside Gearman. Clients submit a request for work by submitting a variation of the SUBMIT_JOB
message (the variations have priority and whether or not it's a background job). The manager sends the client a JOB_CREATED
message to acknowledge that the job has been received and stored by the manager.
As part of the process of enqueuing a job, the manager is responsible for generating what is known as a job handle. A job handle identifies a job in the system and is unique to a given job manager. Typically the job handles are in the format of H:server_name:incremental_number
; this way a worker or anyone else interacting with that job knows where the job originated from. However, this format is not mandated and should not be relied upon as not all job managers format job handles in this way. After queuing the job, the manager sends the client a JOB_CREATED
message containing the job handle that represents the job inside the manager. Once the manager has accepted the job, the job handle becomes the key used to identify the job. For example, if a client submits a background job, it might need that job handle in order to periodically check the status of that job. When work is completed, an error occurs, or another status related to the job needs to be communicated, the client again uses that job handle to associate the WORK_COMPLETE
, WORK_EXCEPTION
, WORK_DATA
, and so on, messages with the correct job. Once work is completed, it is removed from the work queue and it is important to note that completed is not a synonym for successful.
Completed versus successful
One of the nuances of message-queuing or job-queuing systems is understanding the lifecycle of an entity in the system. In Gearman, there is a best-effort contract between the manager and the client that the work will be completed, where complete may mean that the outcome was successful, generated an error, or failed outright. If a worker disconnects from the manager in the middle of processing a job, then the expected behavior is that the manager will simply requeue the job for another worker to pick up later. In a non-persistent queue, the manager will store jobs only for as long as it is required to deliver the job, or until it terminates. If you need jobs to persist after a manager terminates then you will want to read the section on persistence engines later on in the book.
Unique identifiers and coalescing
Each job has a unique identifier associated with it (in addition to the job handle). The client can set a job's unique identifier. If it is not provided then it will be generated by the manager. When the manager sees multiple jobs with the same unique identifier, it will effectively treat them all as only one job. This allows for the manager to reduce the overall workload by coalescing these job requests into a single unit of work, allowing one worker to handle potentially thousands of client requests by processing a single job. When the manager receives the response from the worker, it will forward that data to all clients requesting jobs with the same unique identifier.
To do this, we will perform the following steps:
Write a client that will submit a single job to a job queue with a common unique identifier.
Write a worker that will process jobs in that job queue.
Run multiple copies of the worker, where each will submit the same job to the manager.
Watch the manager coalesce the jobs by processing one job and returning the results to all the clients.
The following client submits a job with a unique identifier set (in this case, it is set to unique_key
) which the manager uses to identify jobs that are identical. Some libraries will automatically generate a unique identifier that is based on the data that is being submitted so that jobs with the same data will automatically be coalesced for you:
The following worker will grab a job from the queue and then sleep for 10 seconds. This gives us enough time to run a few copies of the client to demonstrate that the worker will only get one instance of the job and all the clients will receive the response simultaneously:
To demonstrate this, save the worker code into a file, worker_coalesce.rb
, and then open a terminal and execute it:
The worker will wait for jobs to be submitted by the clients. Next, save the client's code into a file, client_coalesce.rb
, and open three different terminals. Navigate each of these terminals to where you saved the client, and run one copy of the client in each terminal:
Client 1:
Client 2:
Client 3:
Here we have run three copies of the client, each submitting the same job at three different times: 15, 16, and 20 seconds past the minute. Looking at the worker's output, however, we see that the worker has only received a single job (as expected) immediately after the first client submitted its job:
After the worker sleeps for ten seconds, it will push the results back to the manager who will in turn pass the results along to any clients who are waiting for that job (here all three are waiting for the same job). If you look at the terminals for the clients, you will see that all three of them received the same work data at the same time:
Notice that all three clients received the response from the worker to the initial request from the first client. This is because it was the first one in the queue, and so it was pushed out to the worker immediately. We can tell this because the worker sends back the start time in the response that is being printed by the client.
When work is coalesced, there will be only one instance of that job in the queue at any given time. If a foreground job with a given unique identifier is in the queue and some set of clients request work with the same unique identifier, those clients (along with the first client) are grouped together and are all delivered the same results when the work completes execution. If the requests are for background jobs, then subsequent requests will effectively be ignored if there is still a job in the queue (active or not) with the same unique identifier.
This allows the manager to prevent clients from flooding the queue with the same requests for data. This makes the system more efficient by fanning out the results to multiple clients who are all interested in the same set of results. This also prevents what is sometimes referred to as the thundering herd where an influx of requests or some other event causes a large number of clients request the same work.
One of the best benefits of building your system to use Gearman is that it provides you with a structured way to scale your system horizontally. This is the ability to solve problems of larger size by adding more nodes to the system rather than simply increasing the processing power of a single node (referred to as vertical scaling.) Even implementing a very basic configuration will provide you with the capability to start small and grow, as you need to.
A very simple Gearman infrastructure would look like the following diagram. Here we have one worker, one client, and one manager:
This setup does not require that you have three physical (or virtual) servers available in order to begin to take advantage of this setup. In fact, you can do it with as little as one server. By starting out this way, you can begin to architect your system to take advantage of this new infrastructure without incurring any additional hardware costs.
However, in practice, it is better to have more managers to rotate through in a round-robin fashion so that your infrastructure looks more like the following diagram:
In this scenario the system would be composed of multiple clients, each of whom has a list of managers that it can connect to, and each worker is also connected to (and receiving jobs from) the layer of managers.
By arranging your infrastructure this way, if one of the managers fails, then the workers and the clients still have managers they can interact with, so that work doesn't stop being processed. Your system would be operating at a reduced capacity, but organizing it in this fashion not only provides the system with more capacity but also removes a possible point of failure from the system.
In this example we won't build a system with dozens of managers, workers, and clients; we will build a smaller version by starting two managers as follows:
Writing a client that will submit jobs to the two managers.
Running the client and watching it submit jobs to both managers.
Terminating one of the managers and watching the client submit jobs to the remaining manager.
Restarting the manager.
Writing a worker to process our jobs.
Running the new worker and watching it process jobs from both managers.
Running multiple managers
Open two terminals, and navigate to where you saved the JAR file for the Gearman server. Run an instance of the manager in each terminal, specifying a different listening port on each of them for both the manager and the web interface:
Server 1:
Server 2:
Notice that here we have added the --debug flag on the command line. This is done so that you can see the client's requests being split up between the two servers, and watch all the jobs be submitted to one of them when the other is terminated later on.
Now you have two job managers running on one server. While this obviously won't help alleviate any processing bottlenecks or power-failure issues, it will allow us to demonstrate working with multiple managers.
Writing a client that supports multiple managers
To submit jobs to these managers, we will write a client that will submit jobs indefinitely to two different servers, sleeping for a half a second between requests:
Submitting jobs to multiple managers
Save this client in a file, multiserver_client.rb
, and run it to see if it submit jobs to the first server in the list. An example of output from the client might look like the following code:
For each job submitted, the manager will log that it received a SUBMIT_JOB_BG
message, and that it responded with a JOB_CREATED
message to the client. For every job submitted to the server, you should see something in the server output like the following:
Handling a manager being terminated
When both managers are running, the single client will be submitting jobs to both of the managers, likely in a fairly round-robin fashion. Now, terminate one of the running servers, and verify that the client continues to write its jobs to the other server. You should see twice the level of output from the still-running server.
Now stop the client and restart the recently-terminated manager. Currently the Ruby library doesn't recognize when a manager that it previously had has come back online, but that is outside the scope of this book, so we will simply work around it for now.
Processing jobs from multiple managers
Next, we will write a simple worker that will connect to both servers that we have running:
Store this in a file, multiserver_worker.rb
, and run it the same way we have before:
Notice that the worker is connected to both of our servers and will accept jobs from either of them; you can verify this by watching the logs of both managers. The worker will process all the jobs that are currently in the queue (only the manager that did not terminate will still have jobs).
Now, start the client again and watch as it sends jobs to both managers. Watching the worker output will show that those jobs are processed by the worker. This is a very simple setup but provides you with the basic building blocks to expand out your manager, client, and worker fleet from only a few to hundreds or even thousands.
Tip
Here you can try running multiple clients, workers, and managers and see what happens when you terminate various components in the system.
MapReduce is a technique that is used to take large quantities of data and farm them out for processing. A somewhat trivial example might be like given 1 TB of HTTP log data, count the number of hits that come from a given country, and report those numbers. For example, if you have the log entries:
Then the answer to the question would be as follows:
Clearly this example dataset does not warrant distributing the data processing among multiple machines, but imagine if instead of five rows of log data we had twenty-five billion rows. If your program took a single computer a half a second to process five records, it would take a little short of eighty years to process twenty-five billion records. To solve for this, we could break up the data into smaller chunks and then process those smaller chunks, rejoining them when we were finished.
To apply this to a slightly larger dataset, imagine you extrapolated these five records to one hundred records and then split those one hundred records into five groups, each containing twenty records. From those five groups we might compute the following results:
If we were to combine these data points by using the country name as a key and store them in a map, adding the value to any existing value, we would get the count per country across all one hundred records.
Using Ruby, we can write a simple program to do this, first without using Gearman, and then with it.
To demonstrate this, we will write the following:
A simple library that we can use in our non-distributed program and in our Gearman-enabled programs
An example program that demonstrates using the library
A client that uses the library to split up our data and submit jobs to our manager
A worker that uses the library to process the job requests and return the results
First we will develop a library that we can reuse. This will demonstrate that you can reuse existing logic to quickly take advantage of Gearman because it ensures the following things:
The program, client, and worker are much simpler so we can see what's going on in them
The behavior between our program, client, and worker is guaranteed to be consistent
The shared library will have two methods, map_data
and reduce_data
. The map_data
method will be responsible for splitting up the data into chunks to be processed, and the reduce_data
method will process those chunks of data and return something that can be merged together into an accurate answer. Take the following example, and save it to a file named functions.rb
for later use:
To use this library, we can write a very simple program that demonstrates the functionality:
Put the contents of this example into a Ruby source file, named mapreduce.rb
in the same directory as you placed your functions.rb
file, and execute it with the following:
This script will generate a list with one hundred elements in it. Since there are four distinct elements, each will appear 25 times as the following output shows:
Following in this vein, we can add in Gearman to extend our example to operate using a client that submits jobs and a single worker that will process the results serially to generate the same results. The reason we wrote these methods in a separate module from the driver application was to make them reusable in this fashion.
The following code for the client in this example will be responsible for the mapping phase, it will split apart the results and submit jobs for the blocks of data it needs processed. In this example worker/client setup, we are using JSON as a simple way to serialize/deserialize data being sent back and forth:
This client uses a few new concepts that were not used in the introductory examples, that is, task sets and unique identifiers. In the Ruby client, a task set is a group of tasks that are submitted together and can be waited upon collectively. To generate a task set, you construct it by giving it the client that you want to submit the task set with:
Then you can create and add tasks to the task set:
Finally, you tell the task set how long you want to wait for the results:
This will block the program until the timeout passes, or all the tasks in the task set complete hold true (again, complete does necessarily mean that the worker succeeded at the task, but that it saw it to completion). In this example, it will wait 100 seconds for all the tasks to complete before giving up on them. This doesn't mean that the jobs won't complete if the client disconnects, just that the client won't see the end results (which may or may not be acceptable).
To complete the distributed MapReduce example, we need to implement the worker that is responsible for performing the actual data processing. The worker will perform the following tasks:
Receive a list of countries serialized as JSON from the manager
Decode that JSON data into a Ruby structure
Perform the reduce operation on the data converting the list of countries into a corresponding hash of counts
Serialize the hash of counts as a JSON string
Return the JSON string to the manager (to be passed on to the client)
Notice that we have introduced a slight delay in returning the results by instructing our worker to sleep for four seconds before returning the data. This is here in order to simulate a job that takes a while to process.
To run this example, we will repeat the exercise from the first section. Save the contents of the client to a file called mapreduce_client.rb
, and then contents of the worker to a file named mapreduce_worker.rb
in the same directory as the functions.rb
file. Then, start the worker first by running the following:
And then start the client by running the following:
When you run these scripts, the worker will be waiting to pick up jobs, and then the client will generate five jobs, each with a block containing a list of countries to be counted, and submit them to the manager. These jobs will be picked up by the worker and then processed, one at a time, until they are all complete. As a result there will be a twenty second difference between when the jobs are submitted and when they are completed.
Parallelizing the pipeline
Implementing the solution this way clearly doesn't gain us much performance from the original example. In fact, it is going to be slower (even ignoring the four second sleep inside each job execution) than the original because there is time involved in serialization and deserialization of the data, transmitting the data between the actors, and transmitting the results between the actors. The goal of this exercise is to demonstrate building a system that can increase the number of workers and parallelize the processing of data, which we will see in the following exercise.
To demonstrate the power of parallel processing, we can now run two copies of the worker. Simply open a new shell and execute the worker via ruby mapreduce_worker.rb
and this will spin up a second copy of the worker that is ready to process jobs.
Now, run the client a second time and observe the behavior. You will see that the client has completed in twelve seconds instead of twenty. Why not ten? Remember that we submitted five jobs, and each will take four seconds. Five jobs do not get divided evenly between two workers and so one worker will acquire three jobs instead of two, which will take it an additional four seconds to complete:
Feel free to experiment with the various parameters of the system such as running more workers, increasing the number of records that are being processed, or adjusting the amount of time that the worker sleeps during a job. While this example does not involve processing enormous quantities of data, hopefully you can see how this can be expanded for future growth.
Increasing processing power (vertical scaling) is not the only way to scale a solution. Scaling involves changing the way you think about your data and optimizing your system to take advantage of the resources you have. Building a system that scales means being able to add more capacity as needed in a predictable and manageable way.
One aspect of system optimization is reducing the amount of data that is passed between components of the system. Distributed software can be thought of as one program running on separate systems. In a traditional software program that runs as one process on one system, we look for ways to optimize our program by passing data as references (pointers) instead of passing data as values (copies). In our MapReduce program, we are passing around small amounts of data (a few hundred bytes) between the workers, managers, and clients. Imagine if you tried to take this same approach to processing a multi-terabyte file. It would not make sense to try to transfer hundreds, if not thousands, of megabytes between hosts multiple times.
When working with large objects in your program, you don't want to copy data into and out of functions due to the overhead in CPU time spent copying data as well as the extra memory being consumed by keeping multiple copies of the same data. Because these are distributed pieces of software with no shared memory, and in a lot of cases the data being worked with is so large as to make it impractical to store in memory or send copies around the network, the amount of data being passed between systems should remain as small as possible.
Ideal ways to share data between systems are as follows:
Primary keys so the client can retrieve the data itself in an optimal fashion
URLs to access the data via an API
Paths to shared data storage devices (NFS, SAN, etc.)
File offsets (if you are processing the same file from multiple clients)
Additionally, the mechanism for serializing this data is entirely up to you as a developer. Some methods include (but are not limited to) the following:
JSON
XML
BSON (Binary JSON)
MessagePack
Processing a large data file
Let's look at how we might use the methodologies we have learned to take advantage of using Gearman to process a logfile that is multiple terabytes in size containing billions of records.
To further build upon our MapReduce program, we could write a client whose job is to look at the size of the file and generate requests to process approximately 50 MB blocks of data, then to take the results and merge them together. To solve such a problem, we would need to take an approach that reduces the amount of data being transferred between the clients, managers, and workers (that is, it does not pass around 50 MB chunks of data). As pointed out before, this data not only has to be transferred from the client to the manager and the manager to the worker (thereby transmitting an amount of data equal to two times the block size), but the manager stores all that data. It is unlikely that the manager would be capable of storing terabytes of job data in memory, and trying to do so would cause the system to fail.
One solution would be to store the data in a location that both clients and workers have access to (network share, or other solution) and then build a client that takes the following approach:
The client opens the file to be processed and begins scanning it.
For every group of N lines (where N is some number predetermined number):
The client determines the beginning and ending byte offset of that set of lines
The client submits a background job to the manager for a worker to process with three parameters: path to the file, starting offset, and ending offset
Once the file is processed, the client is finished with its work.
This approach ensures that the data being shared between client, manager, and worker is minimal (path, beginning and ending offset) while breaking up the file into blocks of data that can be processed by the worker. Because each block is on the small side, this solution is able to take advantage of having numerous workers on hand, each of which would take the following approach to processing the data file:
Receive a job from the manager with the path, starting offset, and ending offset.
Open the data file seeking the starting offset.
Read a block of data that is (ending offset minus starting offset) bytes in size.
Split up the block of data by line-ending.
Process each line the same way we processed them in the simple MapReduce example.
Submit a job to the manager with the computed data for a different worker to merge with the results from other workers.
Because we often offload long-running or CPU-intensive work to Gearman (processing a three terabyte logfile in the way we did before might take a very long time), workers and clients are often operating in a disconnected mode. As a result, the worker's status updates are not being delivered directly to the client. To accommodate both the synchronous and asynchronous usecase, Gearman workers can provide periodic data updates, or status updates as they process data.
Here, we will demonstrate these two mechanisms by building the following:
A worker that will periodically send real-time data results to the client
A client that will process real-time data updates
A worker that will emit periodic status updates
A client that will handle status updates
In this example, the worker responds periodically with data updates, each one is forwarded to any clients that are listening. These updates are designed to inform clients of the results, not the overall status of the work.
This is demonstrated in the following example:
These periodic updates are emitted as WORK_DATA
packets, which when forwarded to the client, should be handled.
In the Ruby client library, an on_data
handler is attached to a Task
object that processes these data updates. This is shown in the following example:
The other way that workers can communicate with a client is to have the worker send periodic status updates. These are different than data updates as they contain an indication of the percentage of work complete, not intermediate results. The status update mechanism operates by having the worker periodically send back a status message containing a numerator and a denominator in order to tell the manager how much of the work has been done. These messages are also stored by the manager internally so they can be used to communicate the progress of both synchronous and asynchronous jobs.
In this example, the worker uses the job.report_status
(numerator, denominator) method to inform the manager of its progress. Each time through the work loop, the worker will print a message, sleep for one second and then send a status update to the manager saying that it is one step further in the cycle:
The primary difference between the WORK_DATA
response and the WORK_STATUS
response is that the manager is expected to store the results of the WORK_STATUS
message that is sent back by the worker to allow for detached clients to periodically determine the status of their job. The manager stores the numerator and the denominator so that subsequent inquiries by the client using the GET_STATUS
message will be responded to with the status of the job in question.
Unfortunately, the Ruby library doesn't have a good way to access a job by the job handle. As a result, this example is written using Python; the following example client will submit a background job that sleeps for six hundred seconds, periodically fetching the job's state:
This Python client will submit a job to the sleep
queue with the value of 600
as its data, and then, while the job is not 100% complete, it will periodically fetch the status from the manager by sleeping for five seconds between requests, repeating the loop until the job is complete.
Building a processing pipeline
By this point, you should have a pretty good handle on writing Gearman workers and clients. Using these tools, we can build a pipeline of workers for the purpose of data processing with Gearman. In such a system, each worker provides some subset of the overall data processing and passing the completed data to the next system for further processing, storage, and so on. Clients do not have to be their own program; clients are simply any program that submits jobs to a queue. This means that a program could act only as a client and only submit jobs, or it could be responsible for processing jobs as well (that is, both a worker and a client in the system). Building a processing pipeline involves moving work through the pipeline; once a worker is done with processing some data, it will pass the results to other workers.
In the following architecture diagram, a Rails-based web application submits a job to the resize_image
queue. The image-resizing worker, once it has completed resizing an image, submits jobs to the queues process_metadata
, generate_photostream
, and notify_user
to take further action on the newly-resized image. In this way, the image-resizing worker operates both as a worker and as a client.
The following example worker demonstrates this type of architecture by resizing an image and submitting jobs to other workers in the pipeline once it is complete to further process the image:
As you can see in the previous example, once the image is processed, it requests that other workers take on the tasks of processing the image metadata, adding the newly resized image into a live photo stream (possibly a page where all the recently uploaded or resized images are displayed), and then notifies the user that the image was successfully resized.
One of the key design goals of Gearman was that it should be language agnostic, allowing for taking advantage of the strengths of various programming languages. Because Gearman is a network protocol, any languages can have first-class support for Gearman. There are libraries for most languages and, if one does not exist, writing a library is a fairly trivial exercise because of the well-documented and straightforward network protocol.
When processing messages, the manager does not inspect anything that is not in the header of the message. The header contains information needed to determine how the manager should handle the message including what type of message is being processed, which queue the message belongs to, the priority of the message, and a few other attributes that vary with the message type.
Tip
To best leverage multiple programming languages, your data will need to be passed between the workers and clients in a format that is not language-specific. All of the examples in this book use JSON as an intermediate format due to the wide array of libraries available for parsing and generating JSON data, but that is not a requirement. Choose whatever serialization mechanism works best for you.
Given this, let's look at some ways that could optimize our example application. Initially, in our example, the web application and workers are written in Ruby. Ruby is not necessarily as well-suited for image processing as others are, giving us a place where we could easily introduce some optimization. Because our architecture is composed of multiple workers using Gearman as the glue between them, we can easily replace various components in the system to take advantage of a worker that is better optimized for a given type of work.
Some enhancements to the system might include:
Replacing our image-resizing worker with one written in C++ using the OpenCV graphics library.
Running that worker on cloud servers with dedicated GPUs to make image processing even faster (OpenCV CUDA functions can operate as much as 20-30 times faster than CPU-based operations on images).
Gearman managers have two main modes of operation:
Non-persistent, in-memory temporal storage
Write-through in-memory storage with a persistent storage engine
Depending on your needs and the server you are using, you can adjust this behavior. In the daemon, we are using the persistence engines that are designed to store the actual job data while the data stored in memory is the minimal set of data required to perform jobs. In the case where no persistence engine is chosen, the behavior is to have an in-memory persistence engine. In this mode of operation, any termination of the manager will result in a loss of jobs as they are only stored in memory. Some daemons have modes where they will operate only in-memory but will persist the contents of their memory to disk if they are shutdown cleanly.
This information is important because there are a few different impacts that this has on both performance and expected behavior. Any time you introduce persistence there will be degradation in performance due to the time required to store the data on-disk, and optionally verify that it was written. This performance hit will vary based on a variety of factors such as the frequency that data is persisted (that is, write-through or write-behind strategies), the storage medium being written to, and network conditions. It is left up to the reader to test out the various implementations being evaluated and determine which, if any, persistence technologies to put in place in their production environment.
Persistent versus non-persistent
The cost of performance is reliability. In-memory queues are the highest performing because they don't have to wait for the data to be stored a second time. However, when using an in-memory-only data store, any number of issues can cause you to lose your job queues such as power failure, system errors, or simply incorrectly terminating the process. It is important to remember that in this mode, if the service terminates, the queues are lost as well.
Depending on your situation, the loss of jobs may have no significant impact on the overall functionality of the system. If, for example, you have a system that uses Gearman to periodically process all the files in a given directory every minute where jobs were routinely being recreated, the memory-only queues would provide the highest throughput. If the manager is terminated and loses the job queues, but is brought back online within a few minutes, then only a few minutes of processing would be lost.
However, in other cases and for a variety of reasons, it is critical that a job that is submitted be completed. Perhaps the data that is being passed to the job is unique and cannot be regenerated (such as a new user sign up) or the process has a very long window between runs (such as a specific job that only runs once per week). In these types of usecases, the manager must guarantee to the client that once the manager has told the client that the job has been created, a worker will process the job at some point.
The degree to which that data is guaranteed is up to you—the backing data store could vary between Redis, PostgreSQL, MySQL, Riak, or any number of other storage engines that the servers can talk to. On one end of the spectrum you might use Redis for fast, reasonably-safewrite-behind data storage; on the other you might choose to use a PostgreSQL cluster with guaranteed transactional storage, streaming replication, and hot failover. Each of these configurations comes with its own set of concerns and performance impacts; which one you choose will depend on your environment and your storage needs.
Gearman provides us with a mechanism and the infrastructure required for developing scalable, multiplatform, and parallel systems. By learning to use it in your applications, you can build systems that are capable of easily processing large amounts of data. Additionally, you can use Gearman's asynchronous features to offload data processing from the frontend to backend systems, allowing your user interface to be responsive and flexible. Whether you are moving sign up processing to the background or implementing data crunchers that process gigabytes of geospatial data, having tools like Gearman in your tool belt can help you to solve a number of difficult problems as well as provide you with the tools you need to handle your application's growth.