Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Practical MongoDB Aggregations
Practical MongoDB Aggregations

Practical MongoDB Aggregations: The official guide to developing optimal aggregation pipelines with MongoDB 7.0

By Paul Done
$71.99
Book Mar 2024 278 pages 1st Edition
eBook
$71.99
Print
$89.99
Subscription
$15.99 Monthly
eBook
$71.99
Print
$89.99
Subscription
$15.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Mar 1, 2024
Length 278 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781835884362
Vendor :
MongoDB
Category :
Table of content icon View table of contents Preview book icon Preview Book

Practical MongoDB Aggregations

MongoDB Aggregations Explained

Getting insights from data stored in a database can be challenging, especially when there are millions or even billions of records to process.

In this chapter, you will learn how the MongoDB aggregation framework is designed to make mass data processing, analysis, and reporting intuitive and performant. Even though you may already be familiar with building basic aggregation pipelines, this chapter will lay a solid foundation to help you understand the mindset required for building more powerful, optimized aggregations for the real world.

By the end of this chapter, you will have a grasp of the following:

  • The purpose and design of the MongoDB aggregation framework
  • The MongoDB aggregation language's approach for building aggregation pipelines
  • Relevant use cases for the MongoDB aggregation framework
  • Suggestions for tools to use to run aggregation pipelines and how to get help if you get stuck

What is the MongoDB aggregation framework?

The MongoDB aggregation framework enables you to perform data processing and manipulation on the documents in one or more MongoDB collections. It allows you to perform data transformations and gather summary data using various operators for filtering, grouping, sorting, and reshaping documents. You construct a pipeline consisting of one or more stages, each applying a specific transformation operation on the documents as they pass through the pipeline. One of the common uses of an aggregation pipeline is to calculate sums and averages, similar to using SQL's GROUP BY clause in a relational database but tailored to the MongoDB document-oriented structure.

The MongoDB aggregation framework enables users to send an analytics or data processing workload—written using an aggregation language—to the database to execute the workload against the data it holds. The MongoDB aggregation framework has two parts:

  • An aggregation API provided by the MongoDB driver that you embed in your application. You define an aggregation pipeline in your application's code and send it to the database for processing.
  • The aggregation runtime in the database that receives the pipeline request from the application and executes the pipeline against the persisted data.

Figure 1.1 illustrates these two elements and their relationship:

Figure 1.1: MongoDB aggregation framework

Each driver provides APIs to enable an application to use both the MongoDB Query Language (MQL) and the aggregation framework. In the database, the aggregation runtime reuses the query runtime to efficiently execute the query part of an aggregation workload that typically appears at the start of an aggregation pipeline.

What is the MongoDB aggregation language?

MongoDB's aggregation pipeline language is somewhat of a paradox. It can appear daunting, yet it is straightforward. It can seem verbose, yet it is lean and to the point. It is Turing complete and able to solve any business problem. Conversely, it is a strongly opinionated domain-specific language (DSL); if you attempt to veer away from its core purpose of mass data manipulation, it will try its best to resist you.

Invariably, for beginners, the aggregation framework seems difficult to understand and comes with an initially steep learning curve that you must overcome to become productive. In some programming languages, you only need to master a small set of the language's aspects to be largely effective. With MongoDB aggregations, the initial effort you must invest is slightly greater. However, once mastered, users find it provides an elegant, natural, and efficient solution to breaking down a complex set of data manipulations into a series of simple, easy-to-understand steps.

The MongoDB aggregation pipeline language is focused on data-oriented problem-solving rather than business process problem-solving. It can be regarded as a functional programming language rather than a procedural programming language. Since an aggregation pipeline is an ordered series of statements, called stages, the entire output of one stage forms the entire input of the next stage, with no side effects. This functional nature is why many users regard the aggregation framework as having a steeper learning curve than many languages—not because it is inherently more difficult to understand but because most developers come from a procedural programming background and not a functional one. Most developers also have to learn how to think like a functional programmer to learn the aggregation framework.

The functional characteristics of the aggregation framework ultimately make it especially powerful for processing massive datasets. Users focus more on defining the what in terms of the required outcome and less on the how of specifying the exact logic to apply to achieve each transformation. You provide one specific and clearly advertised purpose for each stage in the pipeline. At runtime, the database engine can then understand the exact intent of each stage. For example, the database engine can obtain clear answers to the questions it asks, such as, "Is this stage for performing a filter or is this stage for grouping on some fields?" With this knowledge, the database engine has the opportunity to optimize the pipeline at runtime. Figure 1.2 shows an example of the database performing a pipeline optimization. It may decide to reorder stages to optimally use an index while ensuring that the output hasn't changed. Alternatively, it may choose to execute some steps in parallel against subsets of the data in different shards, reducing the response time while again ensuring the output hasn't changed.

Figure 1.2: Database performing a pipeline optimization

Last and least in terms of importance is the syntax. So far, MongoDB aggregations have been described here as a programming language. However, what syntax do you use to construct a MongoDB aggregation pipeline? The answer is it depends, and the answer is mostly irrelevant.

This book will highlight pipeline examples using MongoDB Shell and the JavaScript interpreter it runs in. The book will express aggregation pipelines using a JSON-based syntax. However, if you are using one of the many programming language drivers that MongoDB offers, you will be using that language to construct an aggregation pipeline, not JSON. To learn more about MongoDB drivers, see https://docs.mongodb.com/drivers/. An aggregation is specified as an array of objects, regardless of how the programming language may facilitate it. This programmatic rather than textual format has a couple of advantages compared to querying with a string. It has a low vulnerability to injection attacks, and it is highly composable.

What do developers use the aggregation framework for?

The aggregation framework is versatile and used for many different data processing and manipulation tasks. Some typical use cases include the following:

  • Generating business reports, which include roll-ups, sums, and averages
  • Performing real-time analytics to generate insight and actions for end users
  • Presenting real-time business dashboards with an up-to-date summary status
  • Performing data masking to securely obfuscate and redact sensitive data ready to expose to consumers via views
  • Joining data together from different collections on the server side rather than in the client application for improved performance
  • Conducting data science activities such as data discovery and data wrangling
  • Performing mass data analysis at scale (i.e., big data) as a faster and more intuitive alternative to technologies such as Hadoop
  • Executing real-time queries where deeper server-side data post-processing is required than what is available via default MongoDB Query Language
  • Navigating a graph of relationships between records, looking for patterns
  • Performing the transform part of an extract, load, transform (ELT) workload to transform data landed in MongoDB into a more appropriate shape for consuming applications to use
  • Enabling data engineers to report on the quality of data in the database and perform data-cleansing activities
  • Updating a materialized view with the results of the most recent source data changes so that real-time applications don't have to wait for long-running analytics jobs to complete
  • Performing full-text search and fuzzy search on data using MongoDB Atlas Search, see https://www.mongodb.com/atlas/search
  • Exposing MongoDB data to analytics tools that don't natively integrate with MongoDB via SQL, ODBC, or JDBC (using MongoDB BI Connector, see https://www.mongodb.com/docs/bi-connector/current/, or Atlas SQL, https://www.mongodb.com/atlas/sql)
  • Supporting machine learning frameworks for efficient data analysis (e.g., via MongoDB Spark Connector, see https://docs.mongodb.com/spark-connector)

A short history of MongoDB aggregations

MongoDB released the first major version of the database (version 1.0) in early 2009. Back then, users and the predominant company behind the database, MongoDB, Inc. (then called 10gen), were still establishing the sort of use cases the database would excel at and where the critical gaps were. Within half a year of this first major release, the engineering team at MongoDB identified an essential requirement to generate materialized views on demand. Users needed this capability to maintain counts, sums, and averages for their real-time client applications to query. By the end of 2009, in time for the following major release (1.2), the database engineers introduced a quick tactical solution to address this gap. This solution involved embedding a JavaScript engine in the database and allowing client applications to submit and execute server-side logic using a simple map-reduce-style API. Although from a functional perspective, the MongoDB map-reduce capability provided a solution to the typical data processing requirements of users, it came with some drawbacks:

  • The database used an inherently slow JavaScript engine to execute the user's code.
  • Users had to provide two sets of JavaScript logic: a map (or matching) function and a reduce (or grouping) function. Both were unintuitive to develop and lacked a solid data-oriented bias.
  • At runtime, the database could not determine the specific intent of an arbitrary piece of logic. The database engine had no opportunity to identify and apply optimizations. It couldn't easily target indexes or reorder logic for more efficient processing. The database had to be conservative, executing the workload with minimal concurrency and employing locks at various times to prevent race conditions.
  • If returning the response to the client application, rather than sending the output to a collection, the response payload had to be less than 16 MB.

Over the subsequent two years, MongoDB engineers envisioned a better solution as user behavior with the map-reduce capability became more understood. Given the ability to hold large datasets in MongoDB, users increasingly tried to use map-reduce to perform mass data processing. They were hitting the same map-reduce limitations. Users desired a more targeted capability leveraging a data-oriented DSL. The engineers saw how to deliver a framework enabling developers to define data manipulation steps with valuable composability characteristics. Each step would have a clearly advertised intent, allowing the database engine to apply optimizations at runtime. The engineers could also design a framework that would execute natively in the database and not require a JavaScript engine. In mid-2012, the database introduced the aggregation framework solution in the 2.2 version of MongoDB, which provided a far more powerful, efficient, scalable, and easy-to-use replacement to map-reduce.

Within its first year, the aggregation framework rapidly became the go-to tool for processing large volumes of data in MongoDB. Now, over a decade on, it is as if the aggregation framework has always been part of MongoDB. It feels like part of the database's core DNA. The old map-reduce capability in MongoDB is deprecated and offers no value nowadays. A MongoDB aggregation pipeline is always the correct answer for processing data in the database!

Aggregation capabilities in MongoDB server releases

The following is a summary of the evolution of the aggregation framework in terms of significant capabilities added in each major release of MongoDB from when the framework debuted in MongoDB 2.2:

  • MongoDB 2.2 (August 2012): Marked the initial release of the MongoDB aggregation framework
  • MongoDB 2.4 (March 2013): Focused predominantly on aggregation performance improvements, especially for sorting data, but also included a new string concatenation operator
  • MongoDB 2.6 (April 2014): Enabled unlimited-size result sets to be generated, explain plans to be viewed, the ability to spill aggregations to disk for large sorting operations, the ability to output aggregation results to a new collection, and the ability to redact data flagged as sensitive
  • MongoDB 3.0 (March 2015): Added nothing significant to aggregations apart from some new date-to-string operators
  • MongoDB 3.2 (December 2015): Incorporated many sharded cluster optimizations, added the ability to join data between collections, introduced the ability to sample data, and added many new arithmetic and array operators
  • MongoDB 3.4 (November 2016): Enabled graph relationships in data to be traversed, provided new bucketing and facet capabilities, and added many new array and string operators
  • MongoDB 3.6 (November 2017): Added the ability to convert arrays into objects and vice versa, introduced extensive date string conversion operators, and added the ability to remove a field conditionally
  • MongoDB 4.0 (July 2018): Included new number to conversion operators and the ability to trim strings
  • MongoDB 4.2 (August 2019): Introduced the ability to merge aggregation results into existing collections, added new set and unset stages to address the verbosity and rigidity of project stages, added support for Atlas Search, and included new trigonometry and regular expression operators
  • MongoDB 4.4 (July 2020): Added the ability to union data from multiple collections and define JavaScript functions and accumulator expressions, plus provided many new operators for string replacements, random number generation, and accessing the first and last elements of an array
  • MongoDB 5.0 (July 2021): Introduced the ability to perform operations across a sliding window of documents and added new date manipulation capabilities
  • MongoDB 6.0 (July 2022): Improved support for aggregations performing joining and graph traversing activities in sharded clusters, and added many new stages and operators for filling in missing records and fields, sorting array elements, and accessing subsets of arrays
  • MongoDB 7.0 (August 2023): Introduced a system variable to enable a pipeline to determine the identity of the calling user and their roles as well as providing new median and percentile operators

Getting going

You probably have a preferred tool for prototyping aggregation pipelines, having already explored the MongoDB aggregation framework before reaching for this book. However, suppose you are looking for alternatives. In that case, in the following section, you will find suggestions to get a MongoDB database and client tool up and running, ready to execute the example aggregations presented in this book.

Setting up your environment

To develop aggregation pipelines effectively, and to try the examples in Part 2: Aggregations by Example, you will need:

  • A MongoDB database, version 4.2 or greater, that is network accessible from your workstation
  • A MongoDB client tool running on your workstation to submit aggregation pipeline execution requests and view the results

Note

In Part 2: Aggregations by Example, most example aggregation pipelines are compatible with MongoDB version 4.2 and above. However, some examples utilize aggregation features introduced after version 4.2. For these, the book specifies the minimum MongoDB version required.

Database

The MongoDB database deployment for you to connect to can be a single server, a replica set, or a sharded cluster. You can run this deployment locally on your workstation, remotely on-premises, or in the cloud. You will need the MongoDB URL to connect to the database and, if authentication is enabled, the credentials required for full read and write access.

If you don't have access to a MongoDB database, the two most accessible options for running a database are as follows:

  1. Provision a free-tier MongoDB cluster (see https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/) in MongoDB Atlas, which is a MongoDB cloud-based database as a service (once it's deployed, in the Atlas console, there is a button you can click to copy the URL of the cluster)
  2. Install and run a single MongoDB server (see https://docs.mongodb.com/guides/server/install/) locally on your workstation

Note

Aggregation pipelines in Chapter 13, Full-Text Search Examples, use Atlas Search. Consequently, you must use Atlas for your database deployment if you want to run the few Atlas Search-based examples.

Client tool

There are various options for the client tool, some of which are:

All examples in this book present code that is easy to copy and paste into MongoDB Shell, i.e., mongosh, to execute. All subsequent instructions in this book assume you are using the shell. However, you will find it straightforward to use one of the mentioned GUI tools instead, to execute the code examples.

MongoDB Shell with Atlas database

Here is how you can connect MongoDB Shell to an Atlas free-tier MongoDB cluster:

mongosh "mongodb+srv://mycluster.a123b.mongodb.net/test" --username myuser

Before running the command, ensure:

MongoDB Shell with local database

Here is the command for starting MongoDB Shell and connecting it to a MongoDB single-server database if you've installed MongoDB locally on your workstation:

mongosh "mongodb://localhost:27017"

MongoDB for VS Code

By using the MongoDB Playground tool in VS Code, you can quickly prototype queries and aggregation pipelines and execute them against a MongoDB database with the results shown in an output tab. Figure 1.3 shows the Playground tool in action:

Figure 1.3: MongoDB Playground tool in Microsoft Visual Studio Code

MongoDB Compass GUI

MongoDB Compass provides an Aggregation Pipeline Builder tool to assist users in prototyping and debugging aggregation pipelines and exporting them to different programming languages. You can see the aggregation tool in MongoDB Compass in Figure 1.4:

Figure 1.4: MongoDB Compass

Studio 3T GUI

Studio 3T provides an Aggregation Editor tool to help you prototype and debug aggregation pipelines and translate them to different programming languages. You can see the aggregation tool in Studio 3T in Figure 1.5:

Figure 1.5: Studio 3T

Getting further help

This book does not aim to document every possible option and parameter for the stages and operators that can constitute a MongoDB aggregation pipeline. That's what the MongoDB online documentation is for. Specifically, you should consult the following for help on the syntax of aggregation stages and operators:

If you are getting stuck with an aggregation pipeline and want some help, an active online community will almost always have the answer. So, pose your questions here:

You may be asking for just general advice. However, suppose you want to ask for help on a specific aggregation pipeline under development. In that case, you should provide a sample input document, a copy of your current pipeline code (in its JSON syntax format and not a programming language-specific format), and an example of the output that you are trying to achieve. If you provide this extra information, you will have a far greater chance of receiving a timely and optimal response.

Summary

This chapter explored the purpose and composition of the MongoDB aggregation framework and the pipeline language you use to express data processing tasks. The chapter provided insight into how MongoDB engineers designed and implemented the aggregation framework to process data at a large scale, laying the foundations for subsequent chapters where you will learn how to build richer and more optimal aggregations than you may have done before. You also learned how to find help if you get stuck and to set up your environment to run the aggregation pipelines this book provides.

In the next chapter, you will learn about the best way to construct your aggregation pipelines for composability and robustness, which is especially important when your data structures and aggregation pipelines evolve over time, which is the nature of all modern data-centric applications.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Build effective aggregation pipelines for increased productivity and performance
  • Solve common data manipulation and analysis problems with the help of practical examples
  • Learn essential strategies to aggregate time series data in financial datasets and IoT
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Officially endorsed by MongoDB, Inc., Practical MongoDB Aggregations helps you unlock the full potential of the MongoDB aggregation framework, including the latest features of MongoDB 7.0. This book provides practical, easy-to-digest principles and approaches for increasing your effectiveness in developing aggregation pipelines, supported by examples for building pipelines to solve complex data manipulation and analytical tasks. This book is customized for developers, architects, data analysts, data engineers, and data scientists with some familiarity with the aggregation framework. It begins by explaining the framework's architecture and then shows you how to build pipelines optimized for productivity and scale. Given the critical role arrays play in MongoDB's document model, the book delves into best practices for optimally manipulating arrays. The latter part of the book equips you with examples to solve common data processing challenges so you can apply the lessons you've learned to practical situations. By the end of this MongoDB book, you’ll have learned how to utilize the MongoDB aggregation framework to streamline your data analysis and manipulation processes effectively.

What you will learn

Develop dynamic aggregation pipelines tailored to changing business requirements Master essential techniques to optimize aggregation pipelines for rapid data processing Achieve optimal efficiency for applying aggregations to vast datasets with effective sharding strategies Eliminate the performance penalties of processing data externally by filtering, grouping, and calculating aggregated values directly within the database Use pipelines to help you secure your data access and distribution

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Mar 1, 2024
Length 278 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781835884362
Vendor :
MongoDB
Category :

Table of Contents

20 Chapters
Preface Chevron down icon Chevron up icon
Chapter 1: MongoDB Aggregations Explained Chevron down icon Chevron up icon
Part 1: Guiding Tips and Principles Chevron down icon Chevron up icon
Chapter 2: Optimizing Pipelines for Productivity Chevron down icon Chevron up icon
Chapter 3: Optimizing Pipelines for Performance Chevron down icon Chevron up icon
Chapter 4: Harnessing the Power of Expressions Chevron down icon Chevron up icon
Chapter 5: Optimizing Pipelines for Sharded Clusters Chevron down icon Chevron up icon
Part 2: Aggregations by Example Chevron down icon Chevron up icon
Chapter 6: Foundational Examples: Filtering, Grouping, and Unwinding Chevron down icon Chevron up icon
Chapter 7: Joining Data Examples Chevron down icon Chevron up icon
Chapter 8: Fixing and Generating Data Examples Chevron down icon Chevron up icon
Chapter 9: Trend Analysis Examples Chevron down icon Chevron up icon
Chapter 10: Securing Data Examples Chevron down icon Chevron up icon
Chapter 11: Time-Series Examples Chevron down icon Chevron up icon
Chapter 12: Array Manipulation Examples Chevron down icon Chevron up icon
Chapter 13: Full-Text Search Examples Chevron down icon Chevron up icon
Afterword Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other books you may enjoy Chevron down icon Chevron up icon
Appendix Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.