The History of Data Storage – From the Caves to the Cloud
Data: a critical, life-changing, and fundamental asset that supports humanity’s existence and evolution. For thousands of years (yes, thousands!), data storage solutions have evolved and supported humans by allowing us to “remember” and share knowledge in easy, maintainable, and searchable manners. Data turns into information, which in turn turns into knowledge. The ability to learn from the past and plan for the future is highly influenced by how we manage data in our systems today.
Software engineers are the catalysts of this process: our responsibility is to define and deliver solutions to people’s problems through software engineering – solutions that mostly revolve around data manipulation at a large or small scale. Having understood the importance of persistence in software engineering, you’re ready to bring your solutions’ persistence to the next level.
In this chapter, we will explore the modern era, where databases have become the backbone of our applications and the entire planet. We will cover the following topics:
- Why do databases exist? The history of databases
- Characteristics of Java persistence frameworks
- The cloud’s effect on stateful solutions
- Exploring the trade-offs of distributed database systems – a look into the CAP theorem and beyond
This first chapter provides you with an understanding of the past and current states of data storage technologies, before moving on to more advanced topics. This will give you a better foundation to work from. You will learn how data storage technologies responded to the market’s cloud-shift mentality. Finally, you will become familiar with practices such as Domain-Driven Design (DDD), which perfectly ties in with good persistence development practices, and the challenges faced by distributed data systems that await us in a distributed world, such as the CAP theorem.
Why do databases exist?
A comprehensive understanding of databases is impossible without delving into humanity’s history. The desire to preserve knowledge throughout time has made writing one of the most enduring technologies, and looking back, it was first used in temples and caves, which can be recognized as the first non-computational databases of humankind.
Today, the industry emphasizes accurate and well-recorded information. As a matter of fact, the result of an increasing number of people gaining access to technology and joining the global network of information is reflected in research that states that the amount of data doubles every two years.
The history of modern databases began in 1960, when Charles Bachman designed the first database for computers, the integrated data store, or IDS, a predecessor to IBM’s Information Management System (IMS).
A decade after that, around 1970, one of the most significant events in the history of databases occurred when E. F. Codd published his paper A Relational Model of Data for Large Shared Data Banks, coining the term relational database.
Finally, as the next and probably most recent breakthrough in terms of data storage, came NoSQL, which refers to any non-relational database. Some say NoSQL stands for Non-SQL, while others say it stands for Not Only SQL.
NoSQL databases power some of the most popular online applications. Here are a few:
- Google: Google uses NoSQL Bigtable for Google Mail, Google Maps, Google Earth, and Google Finance
- Netflix: Netflix likes the high availability of the NoSQL database and uses a combination of SimpleDB, HBase, and Cassandra
- Uber: Uber uses Riak, a distributed NoSQL database with a flexible key-value store model
- LinkedIn: LinkedIn built its own NoSQL database called Espresso, which is a document-oriented database
The challenges of handling data
The evolution of database systems has been marked by key milestones over the decades. In the early days, when storage was expensive, the challenge was finding ways to reduce information waste. A reduction of even one million dollars’ worth of information was a significant achievement.
Did you know?
At the dawn of the database era, a megabyte used to cost around 5 million dollars!
Today, megabyte cost isn’t the challenge anymore as we’re living at the cost of 0.001 $/MB. As time passed and storage became cheaper, the methods of reducing duplicate data started to negatively impact an application’s response time. Normalization and the attempts to reduce data duplication, multiple join queries, and massive amounts of data did not help as much.
It’s no surprise that challenges to this model would eventually emerge. As noted by the esteemed and respected authors of the book Fundamentals of Software Architecture (https://www.amazon.com/dp/1492043451/), definitive solutions don’t exist; instead, we are presented with many solutions where each is accompanied by its own set of benefits and drawbacks.
Obviously, the same applies to databases.
There is no one-size-fits-all solution when it comes to data storage solutions.
In the 2000s, new storage solutions, such as NoSQL databases, began to gain popularity and architects had more options to choose from. This doesn’t mean that SQL stopped being relevant, but rather that architects must now navigate the complexities of choosing the right paradigm for each problem.
As the database landscape went through these phases, the application’s scenario also changed. Discussions moved toward the motivations and challenges of adopting a microservices architecture style, bringing us back to the multiple persistence strategies available. Traditionally, architectures included relational database solutions, with one or two instances (given its increased cost). Now, as new storage solutions mature, architectural solutions start to include persistence based on NoSQL databases, scaling up to multiple running instances. The possibility of storing data in multiple ways, throughout different services that compose a single broader solution, is a good environment for potential new solutions with polyglot persistence.
Polyglot persistence is the idea that computer applications can use different database types to take advantage of the fact that various engine systems are better equipped to handle different problems. Complex applications often involve different types of problems, so choosing the right tool for each job can be more productive than trying to solve all aspects of the problem using a single solution.
When analyzing solutions in most recent times, the reality confronts us, developers and architects, with the complexity of choice. How do we handle data, having to consider a scenario with multiple data types? To make it clear, we’re talking about mixing and matching hundreds of possible solutions. The best path is to prepare by learning about persistence fundamentals, best practices, and paradigms. And finally, being aware that no matter how much we desire a fast, scalable, highly available, precise, and consistent solution – we now know that, according to the CAP theorem, a concept discussed later in this chapter, that may be impossible.
Next, we’ll narrow down our focus specifically to persistence within the context of Java applications.
Characteristics of Java persistence frameworks
Let’s grasp the idea of the differences between the Java language and the multiple databases available. Java, an Object-Oriented Programming (OOP) language, naturally offers features such as inheritance, encapsulation, and types, which supports the creation of well-designed code. Unfortunately, not all of these features are supported by database systems.
As a consequence, when integrating both language and database paradigms, some of their unique advantages might get lost. This complexity becomes clear when we observe that in all data manipulation between in-memory objects and the database schema, there should be some data mapping and conversion. It is critical to either define a preferred approach or provide an isolation layer. In Java, the most systematic way to integrate both worlds is through the usage of frameworks. Frameworks come in various types and categories shaped by their communication levels and the provided API dynamics. In Figure 1.1, observe the key aspects of both concepts:
Figure 1.1 – Considerations about the different characteristics of a Java persistence framework
- Communication levels: Define how unrelated the code is from either the database or OOP paradigm. The code can be designed to be more similar to one of the two domains. To clarify, take into consideration two common approaches for integrating a Java app with a database – using a database driver directly or relying on the mapper pattern:
- Directly adopting a driver (e.g., JDBC Driver) means working closer to the database domain space. A database driver that is easy to work with is usually data-oriented. A downside is the need to have more boilerplate code to be able to map and convert all manipulated data between the database model and the Java domain objects.
- The mapper pattern provides the possibility to map a database structure to the Java objects using the completely opposite approach. In the context of mapping frameworks such as Hibernate and Panache, the primary objective is to align more closely with the OOP paradigm rather than focusing primarily on the database. While offering the benefit of reduced boilerplate code, it has as a trade-off, to coexist with a constant object-relational impedance mismatch and its consequent performance impacts. This topic will be covered in more detail in further chapters.
- API abstraction levels: To abstract some level of translation between Java and the database during data manipulation and other database interactions, developers rely on a given Java API. To clarify the abstraction level of an API, you can ask, for example, “How many different database types does a given database API support?” When using SQL as a standard for relational database integration, developers can use a single API and integrate it with all relational database flavors. There are two types of APIs:
- A specific API may offer more accurate updates from the vendor, but it also means that any solution that relies on that API will need to be changed if you ever want to switch to a different database (e.g., Morphia or Neo4j-OGM – OGM stands for Object Graph Mapper)
- An agnostic API is more flexible and can be used with many different types of databases, but it can be more challenging to manage updates or particular behaviors for each one
Code design– DDD versus data-oriented
In the renowned book Clean Code, the author, known as Uncle Bob, states OOP languages have the benefit of hiding data in order to expose its behavior. In the same line of thought, we see DDD, which proposes the usage of a ubiquitous language throughout the domain’s code and related communication. The implementation of such a proposal can be achieved through the usage of OOP concepts. In Data-Oriented Programming, Yehonathan Sharvit suggests simplifying complexity by giving relevance to data and treating it as a “first-class citizen.”
Luckily, there are several frameworks to assist us in the challenges of delivering performant persistence layers. Although we understand that more options bring back the paradox of choice, there’s no need to worry – this book is a helpful resource that software engineers can use to learn how to evaluate multiple perspectives within software architecture, especially the details within the data storage integration and data manipulation space.
So far, we have explored the diverse methods that we humans have devised to address a fundamental issue: efficiently storing data in a manner that ensures longevity and serves as a knowledge base to support our evolution. As technology has advanced, multiple persistence strategies have been made available to software architects and developers, including relational and unstructured approaches such as NoSQL. The variety of persistence options has resulted in new challenges in software design; after all, retrieving, storing, and making data available also went through innovation at the application layer. Persistence frameworks, since then and still today, provide architects with different strategies, enabling designs where development is closely associated with the underlying database technology or is more dynamic and agnostic.
Our next stop on this database historical journey is the cloud era. Let’s explore how cloud offerings have impacted applications and the ways and locations where data can now be stored.
The cloud’s effect on stateful solutions
When it comes to databases, professionals need to have an operational perspective in addition to an infrastructure and software architecture perspective. There are several factors to consider regarding a solution’s architecture and the required compliance, such as networking, security, cloud backup, and upgrades.
Fortunately, we can use the help of cloud services. The cloud, as a technology-related concept, has been defined by the National Institute of Standards and Technology (NIST) as a model that enables the consumption, on-demand and via a network, of a shared set of computing resources that are rapidly made available.
You might have heard a joke in tech communities that says that “the cloud is just somebody else’s computer.” However, we believe there’s more to the cloud than that; we prefer to look at the cloud as follows:
The main goal of adopting cloud services is to outsource non-core business functions to somebody else. This way, we can focus on our core competencies.
As you read through the book, you’ll notice several acronyms are used. In this chapter, we mostly refer to the following cloud service offering types: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).
Even though you might feel like cloud services could finally be the solution to numerous technical problems you’ve gone through, remember that delegated responsibilities and tasks also have chances of going very differently from what you expected – for example, services crashing or costs skyrocketing. Since we’re discussing the action of “delegating a problem to somebody else,” here are three types of cloud services (three ways to “delegate”) and their respective target audiences:
- IaaS: Infrastructure is not your problem. The target audience is people who work on the operation side, such as SREs.
- PaaS: The infrastructure and operation are not your problems. The main target audience is software engineers.
- SaaS: The infrastructure, operation, and software are not your problem. In this case, the target audience is the end user, who doesn’t necessarily know how to code.
As we previously pointed out in this chapter, every solution’s trade-offs must be considered. Picking the PaaS cloud offering as an example: this model offers a higher level of abstraction in exchange for a bit of a higher price tag.
What about cloud offerings for data storage, then? As pointed out by Dan More in the book 97 Things Every Cloud Engineer Should Know (https://www.amazon.com/dp/1492076732), databases can also be used as managed cloud services. Looking at a managed database service, you could consider that someone else (a vendor) will provide a service to abstract most of (and in some cases, all of) the database infrastructure and management tasks.
Cloud services can be helpful when we need to explore various architectural persistence solutions and delegate complexity. They have been widely adopted and proven to be effective in serving this purpose.
With cloud offerings and microservices architecture adoption, distributed solutions are becoming more prevalent. Architects then have to handle new challenges related to data integrity and unexpected occurrences of inconsistency in data in applications that must meet such requirements.
Exploring the trade-offs of distributed database systems – a look into the CAP theorem and beyond
If the perfect Distributed Database System (DDBS) were to be described, it would certainly be a database that was highly scalable, provided perfectly consistent data, and didn’t require too much attention in regard to management (tasks such as backup, migrations, and managing the network). Unfortunately, the CAP theorem, formulated by Eric Brewer, states that that’s not possible.
To date, there is no database solution that can provide the ideal combination of features such as total data consistency, high availability, and scalability all together.
For details, check: Towards robust distributed systems. PODC. 7. 10.1145/343477.343502 (https://www.researchgate.net/publication/221343719_Towards_robust_distributed_systems).
The CAP theorem is a way of understanding the trade-offs between different properties of a DDBS. Eric Brewer, at the 2000 Symposium on Principles of Distributed Computing (PODC), conjectured that when creating a DDBS, “you can have at most two of these properties for any shared-data system,” referring to the properties consistency, availability, and tolerance to network partitions.
Figure 1.2 – Representation inspired by Eric Brewer’s keynote presentation
Towards Robust Distributed Systems. For more information on Eric Brewer’s work, refer to Brewer, Eric. (2000), presentation: https://people.eecs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf.
The three characteristics described in the CAP theorem can be described as follows:
- Consistency: The guarantee that every node in a distributed cluster returns the same, most recent, successful write.
- Availability: Every non-failing node returns a response for all read and write requests in a reasonable amount of time.
- Partition tolerance: The system continues to function and uphold its consistency guarantees despite network partitions. In other words, the service is running despite crashes, disk failures, database, software, and OS upgrades, power outages, and other factors.
In other words, the DDBSes we can pick and choose from would only be CA (consistent and highly available), CP (consistent and partition-tolerant), or AP (highly available and partition-tolerant).
As stressed in the book Fundamentals of Software Architecture: An Engineering Approach, good software architecture requires dealing with trade-offs. This is yet another trade-off to take into consideration (https://www.amazon.com/Fundamentals-Software-Architecture-Engineering-Approach-ebook/dp/B0849MPK73/).
By considering the CAP theorem, we can then apply this new knowledge to back us up in decision-making processes in regard to choosing between SQL and NoSQL. For example, traditional DBMSes thrive when (mostly) providing the Atomicity, Consistency, Isolation, and Durability (ACID) properties; however, in regard to distributed systems, it may be necessary to give up consistency and isolation in order to achieve higher availability and better performance. This is commonly known as sacrificing consistency for availability.
Almost 12 years after the idea of CAP was proposed, Seth Gilbert and Nancy Lynch at MIT published some research, a formal proof of Brewer’s conjecture. However, another expert on database system architecture and implementation has also done some research on scalable and distributed systems, adding, to the existing theorem, the consideration of the consistency and latency trade-off.
In 2012, Prof. Daniel Abadi published a study stating CAP has become “increasingly misunderstood and misapplied, causing significant harm” leading to unnecessarily limited Distributed Database Management System (DDBMS) creation, as CAP only presents limitations in the face of certain types of failures – not during normal operations.
Abadi’s paper Consistency Tradeoffs in Modern Distributed Database System Design proposes a new formulation, Performance and Consistency Elasticity Capabilities (PACELC), which argues that the trade-offs between consistency and performance can be managed through the use of elasticity. The following question quoted in the paper clarifies the main idea: “If there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?”
According to Abadi, a distributed database could be both highly consistent and highly performant, but only under certain conditions – only when the system can adjust its consistency level based on network conditions through the use of elasticity.
At this point, the intricacies of building database systems, particularly distributed ones, have been made crystal clear. As professionals tasked with evaluating and selecting DDBSes and designing solutions on top of them, having a fundamental understanding of the concepts discussed in these studies serves as a valuable foundation for informed decision-making.
Any software application relies heavily on its database, so it’s important to give it the attention it deserves. In this chapter, we explored the interesting history of data storage, from its early days to the modern era of cloud computing. Throughout this journey, we witnessed the impacts of data storage evolution on the field of software engineering, and how Java frameworks have also evolved to be able to support polyglot solutions. As experienced software engineers, it is crucial for us to understand the importance of data and solutions that can manage and manipulate it effectively.
Adding to that, we discussed the challenges of relational databases, such as data redundancy and normalization, and how NoSQL databases emerged to handle unstructured data needs. We introduced the CAP theorem and mentioned additional studies, such as PACELC, to explain the challenges of implementing distributed data storage solutions.
As we continue through this book, we’ll delve deeper into the advanced architectural and development practices, challenges, and trade-offs you must know about in order to deliver the optimal persistence layer for each solution you get to work with from now on, related to data persistence. After taking a look at the history, motivation, and relationship between databases and Java, get ready to explore, in the next chapter, the different types of databases and their pros and cons.