If you have an interest in the computing field of databases, business intelligence, or in-memory technology, you'll have most probably come across the term SAP HANA several times over the last couple of years. Over the course of the next 300 pages or so, we'll be taking a journey together through the SAP HANA development landscape, looking at most of the functions that SAP HANA provides us developers, in order to provide our users with a new outlook on their data and their daily workload.
The book is split into four main sections:
Chapters 1 and 2: Introduction and installation
Chapters 3, 4, 5, and 6: Database development in SAP HANA
Chapters 7, 8, 9, and 10: Ancillary functions useful to most developments
Chapter 11: The XS Engine for website development
In this chapter we'll take a fairly high-level look at SAP HANA and the technology that underlies the system. We'll start with a couple of comparisons to more traditional database systems, and then examine each of the advances in technology that have made SAP HANA's performance possible.
But what really is SAP HANA? The simplest answer to this question is that SAP HANA is a relational database system, just like Oracle or SQL Server, or MySQL. That's it, really? Not very exciting, when you think about it.
Of course, if that was the whole answer, then this book would be very short, so what else is there to tell?
Let's start with a very simple example of what SAP HANA can do for its users.
SAP markets their HANA
database as being really, really fast. But how fast is it really, when there is a large amount of data in the system? The author decided to do a couple of tests to find out. Rather than creating some "fake" data, it was decided that using publicly available information would be better. This means that anyone else wanting to do some performance tests can repeat the experiment.
The US government, as a part of their "open" data initiative of making data available for analysis, released all the data concerning commercial flights inside the USA, starting in 1998 and going right through to the present day. You can find this at http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time.
After a couple of days of downloading the data (49 GB of CSV files), tidying it up a bit, and then loading it into SAP HANA and MySQL, it was possible to do a handful of comparisons on the two different database systems. To be fair, it has to be said that the MySQL machine has only 8 GB of memory, but it has an SSD, while the SAP HANA machine has 128 GB, but "real" disks.
As an indication of SAP HANA's data loading performance, each data file of around 1.5 GB and around five million lines, loaded using SAP HANA's native IMPORT FROM CSV FILE instruction, took approximately 45 seconds (on an average). That means around 100000 lines are loaded every second.
The first thing to note is the sheer data volume—we're talking about over 140 million lines of data, which is a reasonable dataset size. Many companies have this much data in their orders
or stock movement
tables, so doing a couple of stress tests on this dataset allows extrapolation on the performance which we can expect when we use SAP HANA for large applications.
Without explaining all the data fields available, we have some interesting information, such as the origin and destination airports, as well as the flight delay at departure and on arrival. With these fields, we can perform some interesting queries.
We'll start by taking a look at something relatively simple: count the number of flights leaving Los Angeles airport, per year. The SQL statement to achieve this is as follows; the syntax is same in any relational database. Don't worry, we won't be spending 300 pages writing SQL code.
SELECT YEAR, COUNT(*) FROM FLIGHT.ONTIME WHERE ORIGIN = 'LAX' GROUP BY YEAR
First up, let's see how MySQL fares with this query (age before beauty, as they say):

Two minutes for MySQL. Not bad at all. We could probably optimize this, at least a little, by creating an index on the ORIGIN
and YEAR
columns though.
Now how does SAP HANA do on the same query and same dataset:

Well the results of the SQL are the same—which proves no cheating is going on. The time taken to resolve a query on SAP HANA is slightly under 800 milliseconds and around 160 times faster than MySQL. Impressive stuff.
How about something a little more taxing? Let's count the number of flights per day of the week, and then sort them by the number of results.
To see how MySQL fares with this query, check the following screenshot:

Not too shabby I suppose. MySQL looks at the full 140 million-line dataset in less than three minutes. How much time does SAP HANA take for the same query?

Once again, we can see that SAP HANA returns the result in a couple of seconds, where MySQL needs nearly three minutes to do the same.
If we were to summarize these results, we could say that with SAP HANA:
The machine, working more efficiently, is less busy for less time, and can therefore do more work
With a more efficient database such as SAP HANA, your servers will be able to do even more work, getting necessary information into your users' hands faster, enabling them to spend more time analyzing data, rather than waiting for it.
We've just glanced through the simplest usage of SAP HANA: a "standard" (if rather fast) database. SAP HANA can be used to replace legacy databases in more or less any situation: the system provides connectors for several standard database-access protocols (ODBC and JDBC, notably), so just by unplugging your MySQL, PostgreSQL, or SQL server database, and replacing it with SAP HANA, you'll be getting immediate speedups.
We noticed that SAP HANA is very fast, but how exactly are these performance measures achieved? In this section we'll take a look at some of the characteristics that make up a SAP HANA server and just how the performance that we have just seen is achieved.
The first thing that SAP HANA does to speed up SQL queries is to break them down into manageable chunks. For the dataset we used in the previous examples, there were 140 million lines. The server that the SAP HANA
database was running on in the previous example had 12 processors, so to speed up the queries, SAP HANA split the data into 12 chunks (of just under 12 million lines each) and ran the 12 queries all at the same time. Once the results from each subquery came in, SAP HANA then combined the results to provide the overall result.
The advantages of working in this way are immediate: today's computers, especially servers, have several CPUs, each of which has many CPU cores. By putting as many cores to work as possible, SAP HANA maximizes the power which the server can provide to every incoming query.
Most traditional databases claim to work in a similar manner—they run a handful of processes, and so are able to serve several queries at the same time. But there's a critical difference to this: nobody gets the full power of the server. In our example, the only time the server would be fully used is when 12 queries arrive all at the same time (and when a query only takes 800 milliseconds to serve in HANA, this would be relatively rare).
If we imagine a server as an image, it might look like the following image for a query on a legacy database:

We can see that the server runs the query on one of the CPU cores, and the other 11 sit idle waiting for work.
The same query running on SAP HANA might be shown something like this:

The query is running on all 12 CPU cores, making the most possible use of the server's power—and returning results around 12 times faster.
If the only innovation were parallel execution of queries, then on our example server shown here, SAP HANA would be faster 11 times out of every 12 queries received if all queries were received at exactly the same time. In all other scenarios, SAP HANA would systematically return query results faster than the legacy database.
Of course, parallel execution isn't the only trick up SAP HANA's sleeve, there's also….
In any given database, no matter how sophisticated, simple, fast, or slow it is, when it is boiled down, you end up with files of information on the computer's hard disk. The speed at which the database program can get at, interpret, and generally unravel these files impacts the database's perceived speed directly.
Let's picture a table in a database system. This table is a simplified (just for comprehension's sake) version of the table used in the previous section, concerning the US flight records.
Date |
DayOfWeek |
Origin |
Destination |
FlightTime |
DepDelay |
ArrDelay |
---|---|---|---|---|---|---|
19880101 |
5 |
LAX |
JFK |
305 |
10 |
15 |
19880102 |
6 |
LAX |
JFK |
301 |
3 |
4 |
19880103 |
7 |
LAX |
JFK |
311 |
-4 |
7 |
19880104 |
1 |
LAX |
JFK |
282 |
37 |
15 |
19880105 |
2 |
LAX |
JFK |
300 |
0 |
0 |
19880106 |
3 |
LAX |
JFK |
294 |
4 |
-2 |
19880107 |
4 |
LAX |
JFK |
284 |
16 |
0 |
In just about every database system used today (with some exceptions, mostly research projects at universities and a couple of relatively small open-source projects), data is stored in the files basically as you would expect, and might look something like this on disk:
198801015LAXJFK3051015198801026LAXJFK30134198801037LAXJFK311-47198801041LAXJFK2823715198801052LAXJFK30000198801063LAXJFK2944-2198801074LAXJFK284160
The fields have been laid out in the file as they appear in the table, starting with the first row, first field, then second field, and so on to the end of the first row. The operation is then repeated for the other rows in the table. While this is fairly unreadable to us, to a computer program, interpreting this file structure is no trouble at all.
Storing data in this way has one huge advantage for a database server: when you need to add a new line to a table, you just open the file, skip to the end, and write out your data. Adding data to the file is extremely fast.
Inversely, however, reading data can be relatively slow. If we ask the database for the time of the flight from LAX to JFK on the 6th of January 1988 (294 minutes), it has to read the data from the file, parse it until it gets to the Date
field which contains 19880106, then skip to the next Origin
field, check it contains LAX, then the Destination
, checking it contains JFK, then to the FlightTime
field to return the result.
Imagine the work necessary when there are millions of lines in the table.
Databases use indexes to speed up these searches. We might create an index on the FlightDate
field, which would contain the address of the first character of the data for a given flight date. In the first line of the table, the FlightDate
field starts at position 1 in the file. For the second line, it is position 23, and for the third line, position 43. Our index file might therefore look like this: 19880101-1, 19980102-23, 19930103-43, and so on.
This will speed up our data reads, especially if we create an index for FlightDate, Origin, and Destination in the case of the example query we've been looking at. The database system will open the index file (which is a lot simpler than the real data file), find the starting character of the needed data, then open the real data file, skip right to the correct place in the file, and read the data.
Of course, this increase in speed when reading data has one large drawback, we now have two files to maintain when we insert data. The data file must be updated and then any index files need to be updated, too. The more indexes we create to improve reading performance, the more we degrade the writing performance of our database. This, traditionally, has been a big problem in databases used both for transaction creation (so called OLTP, or online transaction processing) and reporting (OLAP, or online analytic processing). To put things simply, writing data and reading data are just not easily optimized at the same time on the same server.
If a query comes in for which there is no index, then the whole data file has to be read anyway, leading to the database administrator's worst nightmare: a "full table scan". This does just what its name suggests, it reads the whole data file (since no index exists there might be another line for 19980106 right at the end of the file, but we won't know that until we've read the whole file).
As you've probably figured out by now, this storage system is not usually how SAP HANA does things. As well as being able to use row-based tables, SAP HANA usually stores data in a column-oriented manner.
Consider a data table such as the following one existing in the database:
Row1-Field1 |
Row1-Field2 |
Row1-Field3 |
Row2-Field1 |
Row2-Field2 |
Row2-Field3 |
Row3-Field1 |
Row3-Field2 |
Row3-Field3 |
Rather than storing the data as:
Row1-Field1, Row1-Field2, Row1-Field3, Row2-Field1, Row2-Field2, and so on.
A column-oriented database stores data as:
Row1-Field1, Row2-Field1, Row3-Field1, Row1-Field2, Row1-Field2, and so on.
If we take another look at the example table we saw earlier in this chapter, that would give us a storage file something like this:
Note
This layout is shown for example purposes, this layout type is true for some column-oriented databases, but SAP HANA does not store the data files exactly like the following storage file, it takes the storage to another level that will be explained in a moment.
198801011988010219880103198801041988010519880106198801075671234LAXLAXLAXLAXLAXLAXLAXJFKJFKJFKJFKJFKJFKJFK305301311282300294284103-43704161547150-20
What difference does this way of storing data make? When you read the table to do reporting, for example, all the values of a column are stored together, so if we need to find "all the rows for January 6th, 1988" then we don't need to look at the whole file, just the first part, which contains the first column. There's no need to read, skip forward, check, read, skip forward, and so on.
SAP HANA takes this one step further, storing each column in a separate file:
Col1:19880101198801021988010319880104198801051988010619880107 |
Col2:5671234 |
Col3:LAXLAXLAXLAXLAXLAXLAX |
Col4:JFKJFKJFKJFKJFKJFKJFK |
Col5:305301311282300294284 |
Col6:103-4370416 |
Col7:1547150-20 |
If SAP HANA needs to answer the query we saw earlier (the time of the flight from LAX to JFK on the 6th of January 1988), then the server can again cut the query into several components:
FlightDate = 19880106 |
Origin = LAX |
Destination = JFK |
Information to return = FLIGHTDATE |
These components can be run in parallel, one CPU handling each of the tests, finding the row numbers which match:
FlightDate = 19880106 – Matching row(s) = 6 |
Origin = LAX – Matching row(s) = 1,2,3,4,5,6,7 |
Destination = JFK – Matching Row(s) = 1,2,3,4,5,6,7 |
Finally all that needs to be done is to find the union of each sub-result set (row 6), and return the final result (which is 294 minutes).
Of course, if reading is optimized then we can wonder whether writing to the database is much slower. As it happens, by storing each column in a separate file, SAP HANA can provide the optimum speed we saw originally in the row-based database system. When a new database row needs to be added to the table, SAP HANA just has to open each individual file, add the appropriate column-value to the end of the file, and then close the file. This can even be done to the columns in parallel.
As we can see, storing data in columns can greatly speed up queries on the database, but column storage allows SAP HANA one more trick which would be difficult with a row-storage database and that is…
If we take another look at our example table (admittedly, the example is rather extreme), we have in the ORIGIN
and DESTINATION
columns, seven values and they're all the same. It's rather a shame to store all seven values; it would be much more space-friendly to store each unique value, and a list of the rows it was present in, something like this:
Col1:19880101198801021988010319880104198801051988010619880107 |
Col2:5671234 |
Col3:LAX(7) |
Col4:JFK(7) |
Col5:305301311282300294284 |
Col6:103-4370416 |
Col7:1547150-20 |
The database could see that in column 3, there are seven occurrences of the value LAX. The space used by this column has just been divided by seven. Imagine the space savings on a database table of 200 million sales order lines, all of which are in USD. The savings in disk space, and therefore in the time it takes for the disk to find data are potentially huge.
It has to be noted that compression has an impact on performance, especially when data is added to the database. Column files need to be uncompressed, new data added, and then recompressed to maintain both compression quality and optimum read speed. In order not to penalize users when data is written to the database, SAP HANA keeps two separate versions of a column file. The main storage contains the compressed table contents, and the delta storage contains data freshly added to the database and which hasn't been compressed yet. When the system is idle, or when the delta storage represents a certain percentage of the main storage, the table is recompressed in the background.
Of course, we could ask why data compression is such a big deal when disks are so cheap nowadays, and that's where SAP HANA's (probably) best playing card can be brought out.
SAP HANA tries as much as possible to not serve data from hard disks. They're just too slow. Whenever a table is requested in a query, SAP HANA loads the whole table into memory, and from then on, serves the query results directly from memory. Reading from memory has been, for many years, a few orders of magnitude faster than reading data from hard disks (SSDs have clawed back some of the hard disk losses, but they're still much, much slower than main memory).
Keeping all the data in memory means that accessing it is at least 150 times faster.
Of course, this means that SAP HANA can't really be run on "just any old server". How many machines have you seen with, say 128 GB of RAM? 256 GB? 1 Tera? Not many, and that's why SAP HANA isn't sold as a piece of software you can "just install".
SAP only sells HANA as an appliance. You cannot purchase SAP HANA as a DVD set from SAP. You have to purchase an appliance which is a certified server configuration from one of the SAP's hardware partners such as Dell, Cisco, or IBM. The hardware partner builds the machine to SAP's specifications, has it certified by SAP as conforming to the requirements of the SAP HANA database, and then installs SAP HANA on it. You can purchase the SAP HANA license from SAP, but the software and the hardware are actually sold together as a bundle: an appliance.
This method allows SAP much greater control over the SAP HANA ecosphere, guaranteeing a certain level of performance from the database, and keeping support calls to a reasonable level.
Since the servers are all configured to a certain level, SAP can more easily tell customers what performance they are likely to achieve from their SAP HANA installation.
Full information about SAP partners' available configurations for SAP HANA appliances are available from the sap.com website at http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/partners/index.epx and click on the SAP HANA available hardware configurations link.
This document requires a SAP Service Marketplace login for downloading, but here is a sample screenshot:

You now have a general idea of what makes SAP HANA different from traditional database systems. SAP HANA is a complete system: a massively-parallel, column-oriented, in-memory database appliance, offering several technological differences to other database systems in order to give superior performance over a number of use cases.
There are several other functions offered by SAP HANA, such as complex data models handled natively, an integrated development environment, and other features. These make the system a complete, integrated option allowing businesses to rethink their database management, and allowing users faster access to their data, and ultimately allowing them to spend their time analyzing the data, not waiting for it.
In the next chapter, we'll be taking a closer look at the SAP HANA development environment, the SAP HANA Studio. This is an Eclipse-based tool giving access to all aspects of the SAP HANA appliance.