Application Performance

Exclusive offer: get 50% off this eBook here
Clojure High Performance Programming

Clojure High Performance Programming — Save 50%

Understand performance aspects and write high performance code with Clojure book and ebook

$20.99    $10.50
by Shantanu Kumar | November 2013 | Open Source

This article by Shantanu Kumar the author of Clojure High Performance Programming, discusses about high level concerns. As opposed to performance analysis and optimization at a smaller component level, it takes a holistic approach for the same at the application level. Higher level concerns, such as serving a certain threshold of users in a day or handling an identified quantum of load through a multilayered system, require us to think about how the components fit together and how the load is designed to flow through the application

(For more resources related to this topic, see here.)

Data sizing

The cost of abstractions in terms of data size plays an important role. For example, whether or not a data element can fit into a processor cache line depends directly upon its size. On a Linux system, we can find out the cache line size and other parameters by inspecting the values in the files under /sys/devices/system/cpu/cpu0/cache/.

Another concern we generally find with data sizing is how much data we are holding at a time in the heap. GC has direct consequences on the application's performance. While processing data, often we do not really need all the data we hold on to. Consider the example of generating a summary report of sold items for a certain period (months) of time. After the subperiod (month wise), summary data is computed. We do not need the item details anymore, hence it's better to remove the unwanted data while we add the summaries. This is shown in the following example:

(defn summarize [daily-data] ; daily-data is a map (let [s (items-summary (:items daily-data))] (-> daily-data (select-keys [:digest :invoices]) ; we keep only the required key/val pairs (assoc :summary s)))) ;; now inside report generation code (-> (fetch-items period-from period-to :interval-day) (map summarize) generate-report)

Had we not used select-keys in the preceding summarize function, it would have returned a map with extra summary data along with all the other existing keys in the map. Now, such a thing is often combined with lazy sequences. So, for this scheme to work, it is important not to hold on to the head of the lazy sequence.

Reduced serialization

An I/O channel is a common source of latency. The perils of over-serialization cannot be overstated. Whether we read or write data from a data source over an I/O channel, all of that data needs to be prepared, encoded, serialized, de-serialized, and parsed before being worked on. It is better for every step to have less data involved in order to lower the overhead. Where there is no I/O involved, such as in-process communication, it generally makes no sense to serialize.

A common example of over-serialization is encountered while working with SQL databases. Often, there are common SQL query functions that fetch all columns of a table or a relation—they are called by various functions that implement the business logic. Fetching data that we do not need is wasteful and detrimental to the performance for the same reason that we discussed in the preceding paragraph. While it may seem more work to write one SQL statement and one database query function for each use case, it pays off with better performance. Code that uses NoSQL databases is also subject to this anti-pattern—we have to take care to fetch only what we need even though it may lead to additional code.

There's a pitfall to be aware of when reducing serialization. Often, some information needs to be inferred in absence of the serialized data. In such cases where some of the serialization is dropped so that we can infer other information, we must compare the cost of inference versus the serialization overhead. The comparison may not be necessarily done per operation, but rather on the whole. Then, we can consider the resources we can allocate in order to achieve capacities for various parts of our systems.

Chunking to reduce memory pressure

What happens when we slurp a text file regardless of its size? The contents of the entire file will sit in the JVM heap. If the file is larger than the JVM heap capacity, the JVM will terminate by throwing OutOfMemoryError. If the file is large but not large enough to force the JVM into an OOM error, it leaves a relatively smaller JVM heap space for other operations in the application to continue. A similar situation takes place when we carry out any operation disregarding the JVM heap capacity. Fortunately, this can be fixed by reading data in chunks and processing them before reading further.

Sizing for file/network operations

Let us take the example of a data ingestion process where a semi-automated job uploads large Comma Separated File (CSV) files via the File Transfer Protocol (FTP) to a file server, and another automated job, which is written in Clojure, runs periodically to detect the arrival of files via the Network File System (NFS). After detecting a new file, the Clojure program processes the file, updates the result in a database, and archives the file. The program detects and processes several files concurrently. The size of the CSV files is not known in advance, but the format is predefined.

As per the preceding description, one potential problem is that since there could be multiple files being processed concurrently, how do we distribute the JVM heap among the concurrent file-processing jobs? Another issue could be that the operating system imposes a limit on how many files can be opened at a time; on Unix-like systems, you can use the ulimit command to extend the limit. We cannot arbitrarily slurp the CSV file contents—we must limit each job to a certain amount of memory and also limit the number of jobs that can run concurrently. At the same time, we cannot read a very small number of rows from a file at a time because this may impact performance.

(def ^:const K 1024) ;; create the buffered reader using custom 128K buffer-size (-> filename java.io.FileInputStream java.io.InputStreamReader (java.io.BufferedReader (* K 128)))

Fortunately, we can specify the buffer size when reading from a file or even from a network stream so as to tune the memory usage and performance as appropriate. In the preceding code example, we explicitly set the buffer size of the reader to facilitate the same.

Sizing for JDBC query results

Java's interface standard for SQL databases, JDBC (which is technically not an acronym), supports fetch-size for fetching query results via JDBC drivers. The default fetch size depends on the JDBC driver. Most JDBC drivers keep a low default value so as to avoid high memory usage and attain internal performance optimization. A notable exception to this norm is the MySQL JDBC driver that completely fetches and stores all rows in memory by default.

(require '[clojure.java.jdbc :as jdbc]) ;; using prepare-statement directly (we rarely use it directly, shown just for demo) (with-open [stmt (jdbc/prepare-statement conn sql :fetch-size 1000 max-rows 9000) rset (resultset-seq (.executeQuery stmt))] (vec rset)) ;; using query (query db [{:fetch-size 1000} "SELECT empno FROM emp WHERE country=?" 1])

When using the Clojure Contrib library java.jdbc (https://github.com/clojure/java.jdbc as of Version 0.3.0), the fetch size can be set while preparing a statement as shown in the preceding example.

The fetch size does not guarantee proportional latency; however, it can be used safely for memory sizing.

We must test any performance-impacting latency changes due to fetch size at different loads and use cases for the particular database and JDBC driver. Besides fetch-size, we can also pass the max-rowsargument to limit the maximum rows to be returned by a query. However, this implies that the extra rows will be truncated from the result, not that the database will internally limit the number of rows to realize.

Resource pooling

There are several types of resources on the JVM that are rather expensive to initialize. Examples are HTTP connections, execution threads, JDBC connections, and so on. The Java API recognizes such resources and has built-in support for creating a pool of some of those resources so that the consumer code borrows a resource from a pool when required and at the end of the job simply returns it to the pool. Java's thread pools and JDBC data sources are prominent examples. The idea is to preserve the initialized objects for reuse. Even when Java does not support pooling of a resource type directly, you can always create a pool abstraction around custom expensive resources.

The pooling technique is common in I/O activities, but it can be equally applicable to non-I/O purposes where the initialization cost is high.

Summary

Designing an application for performance should be based on the use cases and patterns of anticipated system load and behavior. Measuring performance is extremely important to guide optimization in the process. Fortunately, there are several well-known optimization patterns to tap into, such as resource pooling, and data sizing. Thus we analysed the performance optimization using these patterns.

Resources for Article:


Further resources on this subject:


Clojure High Performance Programming Understand performance aspects and write high performance code with Clojure book and ebook
Published: November 2013
eBook Price: $20.99
Book Price: $34.99
See more
Select your format and quantity:

About the Author :


Shantanu Kumar

Shantanu Kumar is a software developer living in Bangalore, India, with his wife. He started learning programing in 1991, using BASIC on MS DOS when he was at school. There, he developed a keen interest in the x86 hardware and assembly language, and he dabbled in it for a good while. Later, he programmed professionally in various business domains and technologies while working with the Indian Air Force and several IT companies.

In recent years, Shantanu has worked on high performance and distributed systems. Having used Java for a long time, he discovered Clojure in early 2009 and has been a fan ever since. Clojure's pragmatism and fine-grained orthogonality continues to amaze him, and he believes he is a better developer because of this.

When not busy with programming or reading up on technical subjects, he enjoys reading non-fiction, riding his bike, and occasionally just lazing in his free time. Shantanu is an active participant in the Bangalore Clojure users group and develops several open source Clojure projects on GitHub.

Books From Packt


Object-Oriented JavaScript
Object-Oriented JavaScript

Groovy 2 Cookbook
Groovy 2 Cookbook

Clojure Data Analysis Cookbook
Clojure Data Analysis Cookbook

Laravel Starter
Laravel Starter

Object-Oriented JavaScript - Second Edition
Object-Oriented JavaScript - Second Edition

Groovy for Domain-Specific Languages
Groovy for Domain-Specific Languages

Creating Mobile Apps with Appcelerator Titanium
Creating Mobile Apps with Appcelerator Titanium

Learning jQuery : Better Interaction Design and Web Development with Simple JavaScript Techniques
Learning jQuery : Better Interaction Design and Web Development with Simple JavaScript Techniques


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
b
g
P
f
M
p
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software