Optimizing Lighttpd

If our Lighttpd runs on a multi-processor machine, it can take advantage of that by spawning multiple versions of itself. Also, most Lighttpd installations will not have a machine to themselves; therefore, we should not only measure the speed but also its resource usage.

Optimizing Compilers: gcc with the usual settings (-O2) already does quite a good job of creating a fast Lighttpd executable. However, -O3 may nudge the speed up a tiny little bit (or slow it down, depending on our system) at the cost of a bigger executable system. If there are optimizing compilers for our platform (for example, Intel and Sun Microsystems each have compilers that optimize for their CPUs), they might even give another tiny speed boost.

If we do not want to invest money in commercial compilers, but maximize on what gcc has to offer, we can use Acovea, which is an open source project that employs genetic algorithms and trial-and-error to find the best individual settings for gcc on our platform. Get it from http://www.coyotegulch.com/products/acovea/

Finally, optimization should stop where security (or, to a lesser extent, maintainability) is compromised. A slower web server that does what we want is way better than a fast web server obeying the commands of a script kiddie.

Before we optimize away blindly, we better have a way to measure the "speed". A useful measure most administrators will agree with is "served requests per second". http_load is a tool to measure the requests per second. We can get it from http://www.acme.com/software/http_load/.

http_load is very simple. Give it a site to request, and it will flood the site with requests, measuring how many are served in a given amount of time. This allows a very simplistic approach to optimizing Lighttpd: Tweak some settings, run http_load with a sufficient realistic scenario, and see if our Lighttpd handles more or less requests than before.

We do not yet know where to spend time optimizing. For this, we need to make use of timing log instrumentation that has been included with Lighttpd 1.5.0 or even use a profiler to see where the most time is spent. However, there are some "big knobs" to turn that can increase performance, where http_load will help us find a good setting.

Installing http_load

http_load can be downloaded as a source .tar file (which was named .tar.gz for me, though it is not gzipped). The version as of this writing is 12Mar2006. Unpack it to /usr/src (or another path by changing the /usr/src) with:

$ cd /usr/src && tar xf /path/to/http_load-12Mar2006.tar.gz
$ cd http_load-12Mar2006

We can optionally add SSL support. We may skip this if we do not need it.

To add SSL support we need to find out where the SSL libs and includes are. I assume they are in /usr/lib and /usr/include, respectively, but they may or may not be the same on your system. Additionally, there is a "SSL tree" directory that is usually in /usr/ssl or /usr/local/ssl and contains certificates, revocation lists, and so on. Open the Makefile with a text editor and look at line 11 to 14, which reads:

#SSL_TREE = /usr/local/ssl
#SSL_INC = -I$(SSL_TREE)/include
#SSL_LIBS = -L$(SSL_TREE)/lib -lssl -lcrypto

Change them to the following (assuming the given directories are correct):

SSL_TREE = /usr/ssl
SSL_INC = -I/usr/include
SSL_LIBS = -L/usr/lib -lssl -lcrypto

Now compile and install http_loadwith the following command:

$ make all install

Now we're all set to load-test our Lighttpd.

Running http_load tests

We just need a URL file, which contains URLs that lead to the pages our Lighttpd serves. http_load will then fetch these pages at random as long as, or as often as we ask it to. For example, we may have a front page with links to different articles. We can just start putting a link to our front page into the URL file, which we will name urls to get started; for example, http://localhost/index.html.

Note that the file just contains URLs, nothing less, nothing more (for example, http_load does not support blank lines). Now we can make our first test run:

$ http_load -parallel 10 -seconds 60 urls

This will run for one minute and try to open 10 connections per second. Let's see if our Lighttpd keeps up:

343 fetches, 10 max parallel, 26814 bytes, in 60 seconds
78.1749 mean bytes/connection
5.71667 fetches/sec, 446.9 bytes/sec
msecs/connect: 290.847 mean, 9094 max,15 min
msecs/first-response: 181.902 mean, 9016 max, 15 min
HTTP response codes:
code 200 - 327


As we can see, it does. http_load needs one of the two start conditions and one of the two stop conditions plus a URL file to run. We can create the URL file manually or crawl our document root(s) with the following python script called crawl.py:

#run from document root, pipe into URLs file. For example:
# /path/to/docroot$ crawl.py > urls
import os, re, sys
hostname = "http://localhost/"
for (root, dirs, files) in os.walk("."):
for name in files:
filepath = os.path.join(root, name)
print re.sub("./", hostname, filepath)


You can download the crawl.oy file from http://www.packtpub.com/files/code/2103_Code.zip.

Capture the output into a file to use as URL file. For example, start the script from within our document root with:

$ python crawl.py > urls

This will give us a urls file, which will make http_load try to get all files (given that we have specified enough requests). Then we can start http_load as discussed in the preceding example. http_load takes the following options:



Useful value


Required start condition



Try to start the given number of new connections per second. Use a high value to see how high we can ramp up the load.



Keep the given number of connections open at any given moment - which will work unless Lighttpd is so fast that http_load can not keep up.

Required stop condition



Keep up the load for the given time in seconds.



Amass the given number of requests

Optional arguments



Output stats every minute (for longer test runs)



Use the proxy specified by host name and port



Time out every request after the given seconds (default to 60)

(If SSL is enabled)


This selects the TLS cipher, for https addresses. We can use one of three keywords or a cipher name. The keywords are: fastsec (RC4-MD5), highsec (DES-CBC3-SHA) and paranoid (AES256-SHA). The SSL manpage for ciphers has a list of all the cipher names.



If -rate was specified as start condition, randomly deviates up to ±10% from the given rate.



Simulate access by modem users (33.6kbps).



Select a random IP address from the file "ips" (one IP address per line) to use as source address. See below.

For the -sip option, we will need a list of IP addresses. Here is a useful python script that will write a number of distinct IP addresses below a given subnet (which we can then route to our loopback device):

#!/bin/env python
# run with: makeips.py 1000
# to create 1000 ip entries in the subnet 101.202.*.*
import random, sys
ips = {}
def makeip(subnet, ip=None):
while ip is None or ips.get(ip):
ip = ".".join(x != "0" and x or str(int(random.random()*256))
for x in subnet.split("."))
ips[ip] = 1
return ip
def makeips(amount, subnet):
maxips = 256 ** sum("0" == x for x in subnet.split("."))
if maxips < amount:
print "Can only fit %i ips in the subnet %s." % (maxips, subnet)
amount = maxips
ipfile = open("ips", "w")
for i in xrange(amount):
ipfile.write(makeip(subnet) + "n")
if __name__ == "__main__":
try: makeips(argv[1], argv[2])
except: print "usage: python makeips.py [amount] [subnet]"

You can download the makeips.py code file at http://www.packtpub.com/files/code/2103_Code.zip.

With this we can route the subnet to the loopback device and make it look as if different clients are requesting the pages. Use this to counter the effect of expiry on our tests; as the only alternative would be to remove mod_expire from the configuration, which is probably not desirable.

Route a whole subnet to our local host or use one we already have With UNIX-like operating systems (Linux, Solaris, BSD, and MacOS X), use the route add command. On Windows (with or without cygwin), we can make use of the fact that the whole 127.*.*.* network is looped back (using the default IP So running python makeips.py 255 will give us a range of IP addresses we can use even if we cannot change our routing tables


Before we work on fine tuning our network, we can tweak some configuration settings to increase performance:

Select the best event handler and write the backend for the job. Here is the recommendation per system:


Linux 2.6

Linux 2.4


Other UNIX







Give up all hope


  1. Use sendfile64 only if large files (> 2GB) are disabled, which we should do if we do not serve files that big.
  2. If we have dynamic content, we should choose our CGI protocol wisely. FastCGI and SCGI are better choices than CGI. If you have the luxury of choosing your CGI language, you might want to try a small and fast scripting language such as Python or even Lua.
  3. Static data is best served from static files. The usual file systems in use today will easily outperform any database/CGI solution. If we have a big page with lots of JavaScript and a smaller part of the page is dynamic (for example, not the JavaScript), we should put the JavaScript in a static .jsfile and link them.
  4. Use SSL only for sensitive data. Clients do not cache data sent over SSL. So if we have images or other static data that does not compromise the client's information, send it over to plain HTTP.
  5. Remove unused modules from the configuration. This is a win-win option for both speed and memory.

There are some settings that affect speed and stability, and depend on the scenario we deploy Lighttpd in. For example, if we have a huge number of concurrent connections open, we can run out of file descriptors. We can counter that by increasing the number of file descriptors in the kernel and setting server.max-fds higher (default is 1024). If we have a lot of small requests, we might increase server.max-keepalive-requests. On the other hand, if we send out a few big files at any given time, we might want to increase the send buffer (note that this has to be allocated for each request, so it might eat into our memory pretty fast). The following are the three scenarios with settings that should give good performance:

  1. Many small requests (typical for AJAX applications): The defaults are quite good here, although for big applications we might raise server.max-keep-alive-requests to 256 or even higher (try how many sessions we can keep alive without running into the file handle barrier).
  2. Big requests (for example, YouTube): Increase the send file buffer; for example, on Linux set the kernel configuration to:
    net.ipv4.tcp_wmem = 4096 65536 524288
    net.core.wmem_max = 1048576

    On BSD set net.inet.tcp.sendspace = 8192 (or even higher, but remember it eats a lot of the kernel RAM per open connection).

  3. Big files upload (for whatever): Do the same to the read file buffer: Under Linux do the same as in step 3, but replace "wmem" with "rmem", and under BSD set net.inet.tcp.recvspace = 8192.

With any server that handles many requests per second, a huge pile of file descriptors is a good thing to have. Note that other applications are also using up file descriptors, for example, the CGI backends (if we have more than one).

These are just common scenarios and some tips to work with them. In any case, run http_load and look at the result. If the throughput is higher, and/or the latency lower, good! If not, roll back the change and try something else.


Specific optimizations

Until now, our methods and tools to measure performance are quite blunt—we can see how fast our Lighttpd is with a specific optimization, but we do not know where to start. Ahmdahl's law (see http://en.wikipedia.org/wiki/Amdahl%27s_law) implies that if we optimize a portion of the code that takes up a portion X of the time by the ratio Y, the resulting speedup is limited by X. The downside of this is optimizing code that never gets called which is a good way to throw away our time. The upside is that if we know which portions of the code takes up most of the time, we know where to optimize.

A crude way of finding out where Lighttpd spends its time (at least between reading and writing) is log timing. As of Lighttpd 1.5.0, there is a new configuration option: debug.log-timing. This option can be enabled to insert timing information into the log files. For each request, the start time plus three intervals will be timed. The interval between receiving and queuing the request, the time used for reading the request, and the time used for writing a response, is in the following format:

write-start: #.#### read-queue-wait: #### ms read-time: #### ms
write-time: #### ms

This timing can be helpful if we want to know whether we should spend our time on optimizing the read cycle or the write cycle. As a rule of thumb, if we have a big read-queue-wait time, we may have too many requests. So increasing file handles or maybe even load-balancing on multiple systems might help. If there is a long read time, look out for uploads or big forms, and try to select a better event handler. If the write time is long see if we can improve the network backend; or if Lighttpd serves a dynamic page, see if we can improve the web application. Perhaps we can use a different CGI backend, introduce caching, and use mod_magnet for very small tasks.

Example: Caching with mod_magnet

Suppose we have a PHP script that runs through a database, fetches a set of records, and creates a HTML page. So far so slow; our database, PHP interpreter, and CGI interface are taxed on every request. Further, suppose that we do not really need millisecond up-to-date data. We could run the PHP script say every five minutes, thus improving its performance as it runs.

Firstly, we can change our PHP script to write the HTML output into a file instead of standard output and send a X-Lighttpd-Sendfile header (enable this in the CGIbackend configuration—refer to Appendix B). This has two benefits: Lighttpd can send out the file directly with no speed penalty and we have the cached file. Make sure our Lighttpd is built with Lua support. Now, we can add the following configuration:

server.modules = ( ..., "mod_magnet", ...)
magnet.attract-physical-path-to = ("/application/" => "app.lua")

Our app.lua can then use the cache to see if the file is older than five minutes and if so call the PHP. The following code does exactly that:

-- app.lua: cache a PHP application for 5 minutes
php_path = "/app.php"
cache_dir = lighty.env["physical.doc-root"] .. "/cache/"
cache_time = 300 -- 300 seconds = 5 minutes
path = lighty.env["physical.rel-path"]
s = lighty.stat(cache_dir .. path)
if s ~= nil then -- not in cache, call out to PHP
lighty.env["request.uri"] = php_path
return lighty.RESTART_REQUEST
if s[8] + cache_time > os.time() then -- too old, call out to PHP
lighty.env["request.uri"] = php_path
return lighty.RESTART_REQUEST
lighty.header["Content-Type"] = "text/html"
lighty.content = {{ filename = cache_dir .. path }}
return 200

This may look like a special case, but there are many web applications out there which do not use any caching at the application level. Plus, its integration into the web server makes this a winning performance for all cases. When the page is cached, it is served almost as fast as a static file.

On the other hand, if the page is not in the cache or is too old, the X-Lighttpd-Sendfile header trick at least reduces the number of file handles needed for the transaction and improves the throughput by shifting the work from our Lighttpd process to the operating system.

Measuring system load

From a holistic viewpoint (or if we plan to invest in hardware), we might be interested in the resource, which is limiting the performance of our Lighttpd. Most UNIX-like systems have a command, vmstat, which shows a small table of system load parameters:

procs ---------memory---------- --swap-- ---io--- --system- ----cpu---
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 302064 0 0 0 0 0 0 0 0 0 0 100 0

In this case, the system is sitting idle. The following fields are of particular interest:

Section / Field


memory / swpd

The amount of swap space used. Ideally, should be zero. If our Lighttpd gets swapped out, performance will degrade dramatically.

memory / free

The amount of free memory. If this gets close to zero, watch out for swapping.

io / bi
io / bo

The amount of blocks received and sent, respectively.  Nothing bad to see here.

cpu / us
cpu / sy

The user mode and kernel mode CPU times, as percentage of available CPU time. Add them. If it is close to 100, we need more CPU capacity.

cpu / id

The percentage of time the CPU sits idly. If greater than zero, our CPU load is not too unhealthy.

cpu / wa

This is the percentage of time the CPU had to wait for IO. This is of special interest to us, as it shows whether we are in need of more threads, a different IO backend, etc.

Under Microsoft Windows operating systems, the task manager shows CPU load, memory/swap file usage, and network performance. If one of these maxes out, we have a candidate for improvement. The basic idea is the same as with the UNIX-like systems.

Profiling with gprof

To see where Lighttpd is spending its time in more detail, the use of a profiler is recommended. gcc comes with a profiling tool called gprof. We first need to tell gcc to prepare a Lighttpd version for profiling, then put it under load with http_load, stop Lighttpd, and run gprof to get a list of functions sorted by the time spent, which we can then interpret to see what to optimize. Now, let's see each step in more detail.

We can create a gprof ready Lighttpd by specifying a flag for the C compiler. This is done with the following commands before calling configure:

$ export CFLAGS=-pg
$ export LDFLAGS=-pg

Otherwise, proceed to create a Lighttpd build. We might also want to install this Lighttpd in a location different from our production build, as the profiling code will slow down our Lighttpd just slightly, and may also fill our file system with profiling data while running. So use the configure –prefix argument to specify a different location, for example, configure –prefix=/opt/lighttpd-gprof.

The build might fail with an error of "undefined references to _mcount" In this case, edit the libtool shell script created by configure. Search for a line compiler_flags= and add -pg so that the line says compiler_flags=-pg. Now make clean all should build our Lighttpd with profiling support.

Given that our build succeeded, we can now execute our profiling Lighttpd and test load to get the profiling data.


Load testing our profiling build

Our profiling build can be run exactly as usual. For directions on load testing using http_load, see the example. For this example, we use a 100-byte HTML file and set http_load to fetch 10,000 times with 10 parallel connections. After running the test and stopping our Lighttpd, we should find a new file with the name gmon.out (given a post-20th century-version of gprof). We can now run gprof to get some statistics. gprof needs at least two parameters: the path to the Lighttpd executable and the path to the gmon.out, our profiling run just created. For example:

$ gprof /opt/lighttpd-gprof/sbin/lighttpd gmon.out

This will show a lot of text, including two interesting tables: the flat profile and the call graph. The flat profile is a table of functions with a percentage of the full runtime, the cumulative runtime in seconds, the internal runtime (self) in seconds, the number of calls, the internal and cumulative runtimes per call, and finally the function name.

The cumulative runtime of a function is the time from when the first line of the function is executed until the execution returns from the function, whereas the internal runtime is the complete runtime minus the sum of complete runtimes of all functions called from the function.

The flat profile is very helpful to determine where our time is best spent optimizing. For us, the internal runtime and the number of calls show how much an optimization might affect total performance. The list is ordered by the percentage of the total time. So the higher on the list a function is, the more time we can shave off by optimizing it. The following is our example flat profile:

% cumulative self self total
time seconds seconds calls ms/call ms/call name
12.79 0.50 0.50 10000 0.05 0.08 http_request_parse
7.42 0.79 0.29 150179 0.00 0.00 array_get_index
5.63 1.01 0.22 370497 0.00 0.00 buffer_caseless...
4.35 1.18 0.17 13806 0.01 0.02 connection_hand...
3.84 1.33 0.15 20128 0.01 0.02 connection_reset
3.84 1.48 0.15 20599 0.01 0.16 connection_stat...
3.32 1.61 0.13 300039 0.00 0.00 buffer_append_s...
3.07 1.73 0.12 10000 0.01 0.06 http_response_p...
2.56 1.83 0.10 10000 0.01 0.05 http_response_w...
2.56 1.93 0.10 10000 0.01 0.01 network_write_c...
2.30 2.02 0.09 370095 0.00 0.00 buffer_prepare_...
2.30 2.11 0.09 40018 0.00 0.00 LI_ltostr
1.66 2.18 0.07 350705 0.00 0.00 buffer_prepare_...
1.66 2.24 0.07 120001 0.00 0.00 buffer_is_equal
1.66 2.31 0.07 633052 0.00 0.00 buffer_reset
1.53 2.37 0.06 310064 0.00 0.00 buffer_copy_str...
1.53 2.43 0.06 80855 0.00 0.00 chunkqueue_remo...
1.53 2.49 0.06 10000 0.01 0.01 connection_close
1.53 2.55 0.06 __divdi3
1.53 2.61 0.06 70019 0.00 0.00 array_insert_un...
1.28 2.66 0.05 120000 0.00 0.00 buffer_append_s...
1.28 2.71 0.05 20000 0.00 0.00 hashme
1.28 2.76 0.05 10000 0.01 0.02 stat_cache_get_...
1.28 2.81 0.05 etag_mutate
1.02 2.85 0.04 60384 0.00 0.00 array_reset
1.02 2.89 0.04 40018 0.00 0.00 buffer_append_long
1.02 2.93 0.04 30256 0.00 0.00 config_setup_co...
1.02 2.97 0.04 10238 0.00 0.02 connection_accept
1.02 3.01 0.04 10000 0.00 0.02 network_write_c...
1.02 3.05 0.04 10000 0.00 0.01 request_check_h...
... ... ... ... ... ... ...

The long function names were cut off in the name of readability, as well as all the following functions one percent of the runtime. As we can see, the http_request_ parse function takes the biggest chunk of runtime. This should be so, given that we are sending out the same short file over and over again, which should be cached from the second request onwards. Note that the http_request_parse function would be top priority on our list should we want to optimize the code, because it has the biggest internal runtime and also gets a decent number of calls (one per request).

An even more detailed report is the call graph. It shows a table for every function with a list of callers before and a list of callees after it. For each entry, the time (complete and internal), and the number of calls (from the parent function and total) is shown. We can use this to find out why a function is called so often, and trace the code paths. However, without a decent visualization, we can easily get lost in the mountains of data. Here is an example call tree (again, function names are cut off for brevity):


index	% time	self	children	called		name
[1] 94.7 0.01 3.69 main [1]
0.00 1.90 288/288 network_server_h... [3]
0.08 1.67 10599/20599 connection_state... [2]
0.00 0.02 1/1 connections_free [60]
0.01 0.01 599/599 connection_hand... [61]
0.00 0.00 1/1 config_read [90]
0.00 0.00 1/1 server_free [95]
0.00 0.00 2/3 log_error_write [104]
0.00 0.00 1/1 plugins_load [107]
0.00 0.00 4/50159 array_get_element [15]
0.00 0.00 1/1 log_error_open [109]
0.00 0.00 1/1 network_init [110]
0.00 0.00 292/292 stat_cache_tri... [111]
0.00 0.00 1/1 plugins_free [115]
0.00 0.00 1/1 config_set_def... [117]
0.00 0.00 1/1 server_init [122]
0.00 0.00 1/10001 fdevent_unregister [42]
0.00 0.00 1/1 network_close [124]
0.00 0.00 1/1 network_regist... [127]
0.00 0.00 1/26794 fdevent_event_del [64]
0.00 0.00 10599/20599 plugins_call_h... [133]
0.00 0.00 887/887 fdevent_event_... [150]
0.00 0.00 887/887 fdevent_event_... [149]
0.00 0.00 887/887 fdevent_event_... [148]
0.00 0.00 887/887 fdevent_get_ha... [152]
0.00 0.00 887/887 fdevent_get_co... [151]
0.00 0.00 639/639 fdevent_poll [156]
0.00 0.00 292/292 plugins_call_h... [160]
0.00 0.00 1/1 plugins_call_init [188]
0.00 0.00 1/1 plugins_call_s... [189]
0.00 0.00 1/1 fdevent_init [180]
0.00 0.00 1/1 stat_cache_init [192]
0.00 0.00 1/10001 fdevent_fcntl_set [134]
0.00 0.00 1/1 log_error_close [186]
0.07 1.58 10000/20599 network_server_h... [3]
0.08 1.67 10599/20599 main [1]
[2] 86.9 0.15 3.25 20599 connection_state_mac... [2]
0.02 1.07 10000/10000 connection_handl... [4]
0.50 0.28 10000/10000 http_request_parse [5]
0.12 0.50 10000/10000 http_response_pr... [6]
0.16 0.06 13207/13806 connection_hand... [14]
0.02 0.18 10000/10000 connection_hand... [19]
0.07 0.10 10000/20128 connection_reset [10]
... ... ... ... ... ...

Although we now have specific numbers where our Lighttpd uses its time, and even which functions gets called from where, we still do not know how to optimize the settings to reduce the runtime.

Alas, unless we want to optimize directly in the source code (which thanks to Lighttpd underlying the revised BSD license and being available in source form, we can), there is no easy way apart from trial and error to find out which setting creates which effect. At least, we can see where this effect comes into play.


Knowing what to optimize beats knowing how to optimize. Therefore, load testing and collecting usage statistics is paramount to improving throughput and minimizing latency. Probably, the most important thing about optimization is to know when to stop.

At the moment, there is no easier way than trial and error to find out what makes our Lighttpd work faster. On the other hand, optimizing for performance may conflict with other goals such as security and maintainability.

Logging the timing of the request or response phases can give us a broad overview where to optimize first. Knowing which system resources limit our Lighttpd's performance can also give us a hint on what to do. If we need a more detailed picture of where our Lighttpd spends its time, profiling is our course of action.

You've been reading an excerpt of:


Explore Title