MongoDB - Indexing and Aggregation Tutorial

Welcome to the fourth chapter of the MongoDB tutorial (part of the MongoDB Developer and Administrator Course). This chapter will explain how to create and manage different types of indexes in MongoDB to execute queries faster.

Let us explore the objectives of this lesson in the next section.

Objectives

After completing this lesson, you will be able to:

  • Explain how to create unique, compound, sparse, text, and geospatial indexes in MongoDB

  • Explain the process of checking the indexes used by MongoDB when retrieving the documents from the database

  • Identify the steps to create, remove, and modify indexes

  • Explain how to manage indexes by listing, modifying, and dropping

  • Identify different kinds of aggregation tools available in MongoDB

  • Explain how to use MapReduce to perform complex aggregation operations in MongoDB.

We will begin with an introduction to Indexing in the next section.

Introduction to Indexing

Typically, Indexes are data structures that can store collection’s data set in a form that is easy to traverse. Queries are efficiently executed with the help of indexes in MongoDB.

Indexes help MongoDB find documents that match the query criteria without performing a collection scan. If a query has an appropriate index, MongoDB uses the index and limits the number of documents it examines.

Indexes store field values in the order of the value.The order in which the index entries are made support operations, such as equality matches and range-based queries. MongoDB sorts and returns the results by using the sequential order of the indexes.

The indexes of MongoDB are similar to the indexes in any other databases.MongoDB defines the indexes at the collection level for use in any field or subfield.

In the next section, we will discuss index types.

Types of Index

MongoDB supports the following index types for querying.

Default _id: Each MongoDB collection contains an index on the default _id (Read as underscore id) field. If no value is specified for _id, the language driver or the mongod (read as mongo D) creates a _id field and provides an ObjectId (read as Object ID) value.

Single Field: For a single-field index and sort operation, the sort order of the index keys do not matter. MongoDB can traverse the indexes either in the ascending or descending order.

Compound Index: For multiple fields, MongoDB supports user-defined indexes, such as compound indexes. The sequential order of fields in a compound index is significant in MongoDB.

Multikey Index: To index array data, MongoDB uses multikey indexes. When indexing a field with an array value, MongoDB makes separate index entries for each array element.

Geospatial Index: To query geospatial data, MongoDB uses two types of indexes—2d indexes (read as two D indexes) and 2d sphere (read as two D sphere) indexes.
Text Indexes: These indexes in MongoDB searches data string in a collection.

Hashed Indexes: MongoDB supports hash-based sharding and provides hashed indexes. These indexes the hashes of the field value. 

We will discuss the index types in detail later in the lesson. In the next section, we will discuss the index properties.

Properties of Index

Following are the index properties of MongoDB.

Unique Indexes

The unique property of MongoDB indexes ensures that duplicate values for the indexed field are rejected. In addition, the unique indexes can be interchanged functionally with other MongoDB indexes.

Sparse Indexes

This property ensures that queries search document entries having an indexed field. Documents without indexed fields are skipped during a query. Sparse index and the unique index can be combined to reject documents with duplicate field values and ignore documents without indexed keys.

Total time to Live or TTL Indexes

These are special indexes in MongoDB used to automatically delete documents from a collection after a specified duration of time. This is ideal for deleting information, such as machine-generated data, event logs, and session data that needs to be in the database for a shorter duration.

In the next section, we will discuss Single field Index.

Want to check the course preview of our MongoDB Developer and Administrator Course? Click here to watch

Single Field Index

MongoDB supports indexes on any document filed in a collection. By default, the _id field in all collections has an index. Moreover, applications and users add indexes for triggering queries and performing operations.

MongoDB supports both, single field or multiple field indexes based on the operations the index-type performs.

db.items.createIndex( { “item" : 1 } )

The command given above is used to create an index on the item field for the items collection.

In the next section, we will discuss how to create single field indexes on embedded documents

Single Field Index on Embedded Document

You can index top-level fields within a document. Similarly, you can create indexes within embedded document fields.

{ "_id" : 3, "item" : "Book", "available" : true, "soldQty" : 144821, "category" : "NoSQL", "details" : { "ISDN" : "1234", "publisher" : "XYZ Company" }, "onlineSale" : true }

The structure shown above refers to a document stored in a collection. In the document, the details field depicts an embedded document that has two embedded fields— ISDN and publisher.

db.items.createIndex( {details.ISDN: 1 } )

To create an index on the ISDN field and the embedded document called “details”, perform the queries shown above. 

In the next section, we will discuss compound indexes.

Compound Indexes

MongoDB supports compound indexes to query multiple fields. A compound index contains multiple single field indexes separated by a comma.

db.products.createIndex( { "item": 1, "stock": 1 } )

The command shown above is an example of a compound index on two fields. 

compound indexes
This diagram depicts a compound index for the fields, userid, and score. The documents are first organized by userid and within each userid, scores are organized in the descending order. The sort order of fields in a compound index is crucial.

The documents are first sorted by the item field value and then, within each item field value, they are further sorted by the stock field values.  

For a compound index, MongoDB limits the fields to a maximum of 31.  

In the next section, we will discuss Index prefixes.

Index Prefixes

Index prefixes are created by taking a different combination of fields and typically, start from the first field.

{ "item": 1, “available”:1, "soldQty“:1}

For example, consider the compound index given above.  

It has the item in the ascending order and available in the ascending order as the index prefixes. MongoDB uses a compound index even if the find queries are for index prefixes fields. It uses indexes for querying the item field, the available field, and the soldQty (read as sold quantity) field. 

MongoDB cannot efficiently support the query on the item and soldQty fields by using index prefixes as it would be like using separate indexes for these fields. The item field is a part of the compound index and the index prefixes. Hence, the item field should be used in the find query of the index.

We will discuss Sort Order in the next section.

Sort Order

In MongoDB, you can use the sort operations to manage the sort order. You can retrieve documents based on the sort order in an index.

Following are the characteristics of a sort order:

  • If sorted documents cannot be obtained from an index, the results will get sorted in the memory.

  • Sort operations executed using an index show better performance than those executed without using an index.

  • Sort operations performed without an index gets terminated after exhausting 32 MB of memory.

  • Indexes store field references in the ascending or descending sort order.

  • Sort order is not important for single-field indexes because MongoDB can traverse the index in either direction.

  • Sort order is important for compound indexes because it helps determine if the index can support a sort operation

In the next section, we will discuss how to ensure that indexes fit in the Random Access Memory or RAM.

Ensure Indexes Fit RAM

To process query faster, ensure that your indexes fit into your system RAM. This will help the system avoid reading the indexes from the hard disk.

To confirm the index size, use the query given above. This returns the data in bytes. To ensure this index fits your RAM, you must have more than the required RAM available. In addition, you must have RAM available for the rest of the working set. 

For multiple collections, check the size of all indexes across all collections. The indexes and the working sets both must fit in the RAM simultaneously.  

In the next section, we will discuss multikey indexes.

Multi-Key Indexes

When indexing a field containing an array value, MongoDB creates separate index entries for each array component. These multikey indexes in queries match array elements with documents containing arrays and select them.

You can construct multikey indexes for arrays holding scalar values, such as strings, numbers, and nested documents.

db.coll.createIndex( { : < 1 or -1 > } )

To create a multikey index, you can use the db.collection.createIndex() (read as D-B dot collection dot create Index) method given above.  

If the indexed field contains an array, MongoDB automatically decides to either create a multikey index or not create one. You need not specify the multikey type explicitly.  

In the next section, we will discuss Compound multikey indexes.

Compound Multi-Key Indexes

In compound multikey indexes, each indexed document can have maximum one indexed field with an array value. If more than one field has an array value, you cannot create a compound multikey index.

{ _id: 1, product_id: [ 1, 2 ], retail_id: [ 100, 200 ], category: "both fields are arrays" }

An example of a document structure is shown above. In this collection, both the product_id (read as product underscore ID) and retail_id (read as retail underscore ID) fields are arrays. Therefore, you cannot create a compound multikey index.

Note that a shard key index and a hashed index cannot be a multikey index.

In the next section, we will discuss hashed indexes in detail.

Hashed Indexes

Following are the characteristics of a hashing function.

  • The hashing function combines all embedded documents and computes hashes for all field values.

  • The hashing function does not support multi-key indexes.

  • Hashed indexes support sharding, uses a hashed shard key to shard a collection, ensures an even distribution of data.

  • Hashed indexes support equality queries, however, range queries are not supported.

You cannot create a unique or compound index by taking a field whose type is hashed. However, you can create a hashed and non-hashed index for the same field. MongoDB uses the scalar index for range queries.

db.items.createIndex( { item: "hashed" } )

You can create a hashed index using the operation given above. This will create a hashed index for the items collection on the item field. 

In the next section, we will discuss TTL indexes in detail.

TTL Indexes

TTL indexes automatically delete machine-generated data. You can create a TTL index by combining the db.collection.createIndex() method with the expireAfterSeconds option on a field whose value is either a date or an array that contains date values.

db.eventlog.createIndex( { "lastModifiedDate": 1 }, { expireAfterSeconds: 3600 } )

For example, to create a TTL index on the lastModifiedDate (read as last modified date) field of the eventlog collection, use the operation shown above in the mongo shell.

The TTL background thread runs on both primary and secondary nodes. However, it deletes documents only from the primary node.  

TTL indexes have the following limitations.

  • They are not supported by compound indexes which ignore expireAfterSeconds

  • The _id field does not support TTL indexes.

  • TTL indexes cannot be created on a capped collection because MongoDB cannot delete documents from a capped collection.

  • It does not allow the createIndex()(read as create index) method to change the value of expireAfterSeconds of an existing index.

You cannot create a TTL index for a field if a non-TTL index already exists for the same field. If you want to change a non-TTL single-field index to a TTL index, first drop the index and recreate the index with the expireAfterSeconds option.

In the next section, we will be discussing creating unique indexes.

Unique Indexes

To create a unique index, use the db.collection.createIndex() method and set the unique option to true.

db.items.createIndex( { “item": 1 }, { unique: true } )

For example, to create a unique index on the item field of the items collection, execute the operation shown above in the mongo shell. By default, unique is false on MongoDB indexes.  

If you use the unique constraint on the compound index, then MongoDB will enforce uniqueness on the combination of all those fields which were the part of the compound key.

Unique Index and Missing Field

If the indexed field in a unique index has no value, the index stores a null value for the document. Because of this unique constraint, MongoDB permits only one document without the indexed field.

In case there is more than one document with a valueless or missing indexed field, the index build process will fail and will display a duplicate key error. To filter these null values and avoid error, combine the unique constraint with the sparse index.

In the next section, we will discuss sparse indexes.

Sparse Indexes

Sparse indexes manage only those documents which have indexed fields, even if that field contains null values. Sparse index ignores those documents which do not contain any index field. Non-sparse indexes do not ignore these documents and store null values for them.

To create a sparse index, use the db.collection.createIndex() method and set the sparse option to true.

db.addresses.createIndex( { "xmpp_id": 1 }, { sparse: true } )

In the example given above, the operation in the mongo shell creates a sparse index on the item field of the items collection. If a sparse index returns an incomplete index, then MongoDB does not use that index unless it is specified in the hint method.

{ x: { $exists: false } }

For example, the second command given above will not use a sparse index on the x field unless it receives explicit hints.

An index that combines both sparse and unique does not allow the collection to include documents having duplicate field values for a single field. However, it allows multiple documents that omit the key. 

In the next section, we will discuss text indexes.

Text Indexes

Text indexes in MongoDB help search for text strings in documents of a collection. You can create a text index for field or fields containing string values or an array of strings.

To access text indexes, trigger a query using the $text (read as text) query operator. When you create text indexes for multiple fields, specify the individual fields or use the wildcard specifier ($**)

db.collection.createIndex({subject: "text",content: "text"})

To create text indexes on the subject and content fields, perform the query given above. The text index organizes all strings in the subject and content field, where the field value is either a string or an array of string elements.  

To allow text search for all fields with strings, use the wildcard specifier ($**). This indexes all fields containing string content.

db.collection.createIndex({ "$**": "text" },{ name: "TextIndex" })

The second example given above indexes any string value available in each field of each document in a collection and names the indexes as TextIndex.

 In the next section, we will discuss text search in MongoDB.

Text Search

MongoDB supports various languages for text search. The text indexes use simple language-specific suffix stemming instead of language-specific stop words, such as “the”, “an”, “a”, “and”. You can also choose to specify a language for text search.

 If you specify the language value as "none", then the text index uses simple tokenization without any stop word and stemming.

db.customer_info.createIndex({“item”: “Text”},{ default_language: "spanish"})

In the query given above, you are enabling the text search option for the item field of the customer_info collection with Spanish as the default language. 

If the index language is English, text indexes are case-insensitive for all alphabets from A to Z.

The text index and the $text operator supports the following:

  • Two-letter language codes defined in ISO 639-1 (read as I-S-O 6-3-9-1).

  • Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish

Note that a compound text index cannot include special index types, such as multi-key or geospatial Index fields.  

In the next section, we will discuss index creation.

Index Creation

MongoDB provides several options to create indexes. By default, when indexes are created, all other operations on a database are blocked. 

For example, when indexes on a collection are created, the database becomes unavailable for any read or write operation until the index creation process completes. 

The read or write operations on the database queue and allow the index building process to complete. Therefore, for index building operations which may consume longer time, you can consider the background operation and thus make MongoDB available even during the entire operation.

db.items.createIndex( {item:1},{background: true})

db.items.createIndex({category:1}, {sparse: true, background: true})

The command given above is used for this purpose. By default, the background is false for building MongoDB indexes.  

We will discuss index creation further in the next section.

Want to test your MongoDB skills? Take the MongoDB free practice test

Index Creation (contd.)

When MongoDB is creating indexes in the background for a collection, you cannot perform other administrative operations involving that collection.

For example, you cannot perform tasks, such as runrepairDatabase, (read as run repair database) drop the collection, or use the query db.collection.drop(),(read as D-B dot collection dot drop) and runcompact (read as run compact).

If you perform any of these operations, you will receive an error.

The index build process in the background uses an incremental approach and is slower than the normal “foreground” index build process. The speed of the index build process depends on the size of the index. If the index size is bigger than the RAM of the system, the process takes more time than the foreground process.

Building indexes can impact your database performance:

  • If the application includes createIndex()(read as create index) operations and

  • If no index is available for operational concerns.

To avoid any performance issues, you can use the getIndexes()(read as get indexes) method to ensure that your application checks for the indexes at the startup.

You can also use an equivalent method for your driver and ensure it terminates an operation if the proper indexes do not exist. When building indexes, use separate application codes and designated maintenance windows.

We will discuss how to create indexes on replica sets in the next section.

Index Creation on Replica Set

Typically, background index operations on a secondary replica set begin after the index building process completes in the primary.

If the index build process is running in the background on the primary, the same will happen on the secondary nodes as well.

If you want to build large indexes on secondaries, you can build the index by restarting one secondary at a time in a standalone mode.

After the index build is complete, restart as a member of the replica set, allow it to catch up with the other members of the set, and then build the index on the next secondary. When all the secondaries have the new index, step down the primary, restart it as a standalone, and build the index on the former primary.

To ensure that the secondary catch up with primary, the time taken to build the index on a secondary must be within an oplog. To catch up with primary node, index creation on secondary nodes always happen in the foreground in the “recovering” mode.
index creation on replica set

db.products.createIndex( { item: 1, quantity: -1 } , { name: "inventory" } )

Instead of using the default name, you can specify a name for the index by using the command given above. This will create an index on the item field whose name will be item_index for the customer_info collection. 

In the next section, we will discuss how to remove indexes.

Remove Indexes

You can use the following methods to remove indexes.  

dropIndex()(read as drop index) method: This removes an index from a collection.  

db.collection.dropIndex() method: This removes an index.

db.accounts.dropIndex( { "tax-id": 1 } )

For example, the first operation given above removes an ascending index on the item field in the items collection.

db.collection.dropIndexes()

To remove all indexes barring the _id index from a collection, use the second operation provided above.

In the next section, we will discuss how to modify an index.

Modify Indexes

To modify an index, first, drop the index and then recreate it. Perform the following steps to modify an index.

Drop Index: Execute the query given below to return a document showing the operation status.

db.orders.dropIndex({ "cust_id" : 1, "ord_date" : -1, "items" : 1 })

Recreate the Index: Execute the query given below to return a document showing the status of the results.

db.orders.createIndex({ "cust_id" : 1, "ord_date" : -1, "items" : -1 })

Rebuild Indexes

In addition to modifying indexes, you can also rebuild them. To rebuild all indexes of a collection, use the db.collection.reIndex() method. This will drop all indexes including _id and rebuild all indexes in a single operation. The operation takes the form db.items.reIndex(). 

To view the indexing process status, type the db.currentOp() (read as D B dot Current operation) command in the mongo shell. The message field will show the percentage of the build completion.

To abort an ongoing index build process, use the db.killOp()(read as D B dot kill operation) method in the mongo shell. For index builds, the db.killOp()may occur after most of the index build operation has completed.

Note that a replicated index built on the secondary replica set cannot be aborted.

 In the next section, we will discuss Listing Indexes.

Listing Indexes

You can list all indexes of a collection and a database. You can get a list of all indexes of a collection by using the db.collection.getIndexes()or a similar method for your drivers.

 For example, to view all indexes on the items collection, use the db.items.getIndexes() method.

db.getCollectionNames().forEach(function(collection) {

indexes = db[collection].getIndexes();

print("Indexes for " + collection + ":");

printjson(indexes);\

});

To list all indexes of collections, you can use the operation in the mongo shell as shown above.

In the next section, we will discuss measure index usage.

Measure Index Use

Typically, query performance indicates an index usage. MongoDB provides a number of tools to study query operations and observe index use for your database. 

The explain() method can be used to print information about query execution. The explain method returns a document that explains the process and indexes used to return a query. This helps to optimize a query.

Using the db.collection.explain() or the cursor.explain() method helps measure index usages.

In the next section, we will discuss control index usage.

Control Index Use

In case you want to force MongoDB to use particular indexes for querying documents, then you need to specify the index with the hint() method.

The hint method can be appended in the find() method as well.

db.items.find({item: “Book”, available : true }).hint({item:1})

Consider the example given above. This command queries a document whose item field value is “Book” and available field is “true”. Here, MongoDB’s query planner is directed to use the index created on the item field.

To view the execution statistics for a specific index, use the explain method in the find command.

db.items.find({item: “Book”, available : true }).hint({item:1}).explain(“executionStats”) db.items.explain("executionStats").find({item: “Book”, available : true }).hint( { item:1 } )

For example, consider the queries given above.

If you want to prevent MongoDB from using any index, specify the $natural (read as natural) operator to the hint() method.

db.items.find({item: “Book”, available : true }).hint({$natural:1}).explain(“executionStats”)

For example, use the following query given above.

In the next section, we will discuss index use reporting.

Index Use Reporting

MongoDB provides different metrics to report index use and operation. You can consider these metrics when analyzing index use for your database. These metrics are printed using the following commands.

serverStatus: serverStatus prints the two metrics.

scanned: Displays the documents that MongoDB scans in the index to carry out the operation. If the number of the scanned document is higher than the number of returned documents, this indicates that the database has scanned many objects to find the target object. In such cases, consider creating an index to improve this.

scanAndOrder: A boolean that is true when a query cannot use the order of documents in the index for returning sorted results MongoDB must sort the documents after it receives the documents from a cursor. If scanAndOrder is false, MongoDB can use the order of the documents in an index to return the sorted results.

MongoDB must sort the documents after it receives the documents from a cursor. If scanAndOrder is false, MongoDB can use the order of the documents in an index to return the sorted results.

collStats: collStats prints the two metrics.

totalIndexSize: Returns index size in bytes

indexSizes: Explains the size of the data allocated for an index

dbStats: dbStats has the following two metrics.

dbStats.indexes: Contains a count of the total number of indexes across all collections in the database

dbStats.indexSize: The total size in bytes of all indexes created in this database

In the next section, we will discuss geospatial index.

Geospatial Index

With the increased usage of handheld devices, geospatial queries are becoming increasingly frequent for finding the nearest data points for a given location.

MongoDB provides geospatial indexes for coordinating such queries. Suppose you want to find the nearest coffee shop from your current location. You need to create a special index to efficiently perform such queries because it needs to search in two dimensions— longitude and latitude.

A geospatial index is created using the createIndex function. It passes "2d" or “2dsphere” as a value instead of 1 or -1. To query geospatial data, you first need to create a geospatial index.

db.collection.createIndex( { : "2dsphere" } )

In the index specification document for the db.collection.createIndex() method, as shown above, specify the location field as the index key and specify the string literal "2dsphere" as the value.

A compound index can include a 2dsphere index key in combination with non-geospatial index keys.

In the next section, we will discuss MongoDB’s Geospatial Query Operators.

MongoDB’s Geospatial Query Operators

The geospatial query operators in MongoDB lets you perform the following queries.

Inclusion Queries

  • Return the locations included within a specified polygon.

  • Use the operator $geoWithin. The 2d and 2dsphere indexes support this query.

  • Although MongoDB does not require an index to perform an inclusion query, they can enhance the query performance.

Intersection Queries

  • Return locations intersecting with a specified geometry.

  • Use the $geoIntersects operator and return the data on a spherical surface.

Proximity Queries

  • Return various points closer to a specified point.

  • Use the $near operator that requires a 2d or 2dsphere index.

In the next section, we will discuss $GeoWith operator. 

$geoWith Operator

The $geoWithin (read as geo within) operator is used to query location data found within a GeoJSON (read as geo J-SON) polygon. To get a response, the location data needs to be stored in the GeoJSON format.

db..find( { : { $geoWithin :{ $geometry :{ type : "Polygon" ,coordinates : [ ]} } } } )

You can use the syntax given above to use the $geoWith Operator.

db.places.find( { loc :{ $geoWithin : { $geometry :{ type : "Polygon" ,

coordinates :[ [[ 0 , 0 ] ,[ 3 , 6 ] ,[ 6 , 1 ] ,[ 0 , 0 ]] ]} } } } )

The example given above selects all points and shapes that exist entirely within a GeoJSON polygon.

We will discuss proximity queries in MongoDB in the next section.

Proximity Queries in MongoDB

Proximity queries return the points closest to the specified point. These queries sort the results by its proximity to the specified point.

You need to create a 2dsphere index in order to perform a proximity query on the GeoJSON data points. To query the data, you can either use the $near or $geonear (read as geo near) operator.

db..find( { :{ $near :{ $geometry :{ type : "Point" ,coordinates : [ <longitude> , <latitude>  ] } , $maxDistance : } } } )

The first syntax given above is an example of the $near operator.

db.runCommand( { geoNear : ,near : { type : "Point" ,coordinates: [ <longitude>, <latitude> ] } ,

spherical : true } )

The $geoNear command uses the second syntax given above. This command offers additional options and returns further information than the $near operator. 

In the next section, we will discuss aggregation.

Aggregation

Operations that process data sets and return calculated results are called aggregations. MongoDB provides data aggregations that examine data sets and perform calculations on them. Aggregation is run on the mongod instance to simplify application codes and limit resource requirements. 

Similar to queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents The aggregation framework in MongoDB is based on data processing pipelines.

Documents pass through multi-stage pipelines and get transformed into an aggregated result. The most basic pipeline stage in the aggregation framework provides filters that function like queries. It also provides document transformations that modify the output document. 

The pipeline operations group and sort documents by defined field or fields. In addition, they perform aggregation on arrays.

Pipeline stages can use operators to perform tasks such as calculate the average or concatenate a string. The pipeline uses native operations within MongoDB to allow efficient data aggregation and is the favored method for data aggregation. 

In the next section, we will continue our discussion on data aggregation.

Aggregation (contd.)

With the help of the aggregate function, you can perform complex aggregation operations, such as finding out the total transaction amount for each customer.

“orders” is the collection that has three fields, cust_id, amount, and status. In the $match (read as match) stage, you will filter out those documents in which status field value is “A”. In the group stage, you will aggregate the “amount” field for each cust_id.

In the next section, we will discuss pipeline operators and indexes.

Pipeline Operators and Indexes

The aggregate command in MongoDB functions on a single collection and logically passes the collection through the aggregation pipeline. You can optimize the operation and avoid scanning the entire collection by using the $match, $limit, and $kip stages.

You may require only a subset of data from a collection to perform an aggregation operation. Therefore, use the $match, $limit, and $skip stages to filter the documents. When placed at the beginning of a pipeline, the $match operation scans and selects only the matching documents in a collection.  

Placing a $match before $sort in the pipeline stage is equivalent to using a query in which the sorting function is performed before looking into the indexes. Therefore, it is recommended to use $match operators at the beginning of the pipeline. 

In the next section, we will discuss aggregate pipeline stages.

Aggregate Pipeline Stages

Pipeline stages appear in an array. Documents are passed through the pipeline stages in a proper order one after the other. Barring $out and $geoNear, all stages of the pipeline can appear multiple times.

The db.collection.aggregate()(read as DB dot collection dot aggregate) method provides access to the aggregation pipeline and returns a cursor and result sets of any size. 

The various pipeline stages are as follows.  

$project: This stage adds new fields or removes existing fields and thus restructure each document in the stream. This stage returns one output document for each input document provided.  

$match: It filters the document stream and allows only matching documents to pass into the next stage without any modification. $match uses the standard MongoDB queries. For each input document, it returns either one output document if there is a match or zero documents when there is no match.  

$group: This stage groups documents based on the specified identifier expression and applies logic known as accumulator expression to compute the output document.  

$sort: This stage rearranges the order of the document stream using specified sort keys. The documents remain unaltered even though the order changes. This stage provides one output document for each input document.  

$skip: This stage skips the first n documents where n is the specified skip number. It passes the remaining documents without any modifications to the pipeline. For each input document, it returns either zero documents for the first n documents or one document. 

$limit: It passes the first n number of documents without any modifications to the pipeline. For each input document, this stage returns either one document for the first n documents or zero documents after the first n documents. 

$unwind: It deconstructs an array field in the input documents to return a document for each element. Each output document replaces the array with an element value. For each input document, it returns n documents where n is the number of array elements and can be zero for an empty array. 

In the next section, we will discuss pipeline operators and indexes.

Aggregation Example

The aggregation operation given above returns all states with the total population greater than 10 million.

db.zipcodes.aggregate( [{ $group: { _id: "$state", totalPop: { $sum: "$pop" } } },

{ $match: { totalPop: { $gte: 10*1000*1000 } } }] )

This example depicts that the aggregation pipeline contains the $group stage followed by the $match stage. 

In this operation, the $group stage does three things:

  1. Groups the documents of the zip code collection under the state field

  2. Calculates thetotalPop (read as the total population) field for each state

  3. Returns an output document for each unique state.

The new per-state documents contain two fields: the _id field and the totalPop field. Here in this command, the aggregate pipeline is used. The $sort stage orders those documents and $group stage applies the sum operation on the amount fields of those documents.

db.users.aggregate([{ $project :{month_joined : { $month : "$joined" },name : "$_id",_id : 0}},{ $sort : { month_joined : 1 } } ])

The second aggregation operation is shown above returns usernames sorted by the month of their joining. This kind of aggregation could help generate membership renewal notices.

In the next section, we will discuss MapReduce.

MapReduce

MapReduce is a data processing model used for aggregation. To perform MapReduce operations, MongoDB provides the MapReduce database command.

A MapReduce operation consists of two phases.

Map stage: Documents are processed and one or more objects are produced for each input document.

Reduce stage: Outputs of the map operation are combined. Optionally, there can be an additional stage to make final modifications to the result.  

Similar to other aggregation operations, MapReduce can define a query condition to select the input documents, and sort and limit the results.

We will continue our discussion on MapReduce in the next section.

MapReduce (contd.)

The MapReduce function in MongoDB can be written as JavaScript codes. 

The MapReduce operations:-

  • Accept a set of documents from a collection of inputs, performs sort and limit functions, and then start the map stage. At the end of a MapReduce operation, the result is generated as documents which can be saved in the collection.

  • Associate values to a key by using the custom JavaScript functions. If a key contains more than one mapped value, then this operation converts them into a single object, such as an array.

The use of custom JavaScript functions makes the MapReduce operations flexible. For example, the map function can generate more than one key and value when processing documents.  

Additionally, the MapReduce operations use the custom JavaScript function to alter the results at the conclusion of the map and reduce operations may perform further calculations.  

We will continue with MapReduce in the next section.

MapReduce (contd.)

If a collection is sharded, then you can use MapReduce to perform many complex aggregation operations.

The orders in the collection have three fields—cust_id, amount and status. If you want to find out the sum of the total amount for each customer, then use the MapReduce framework. In the map stage, cust_id and amount will be generated as the key.

The value will be further processed by the reduce stage in which cust_id an array of the amount will be passed as input to each reducer. The reducer then finds out the total of the amount and generate cust_id as key and order_totals as value.  

In the next section, we will discuss aggregation operations.

Aggregation Operations

Aggregations are operations that manipulate data and return a computed result based on the input document and a specific procedure. MongoDB performs aggregation operations on data sets.

Aggregation operations have limited scope compared to the aggregation pipeline and MapReduce functions.

Aggregation operations provide the following semantics for common data processing options.

Count

Count MongoDB returns all of the documents matching a query. The count command along with the two methods, count() and cursor.count() provides access to total counts in the mongo shell. The db.customer_info.count() command helps count all documents in the customer_info collection.

Distinct

The distinct operation searches for documents matching a query and returns all unique values for a field in the matched document. The distinct command and db.collection.distinct() method execute this operation in the mongo shell.

db.customer_info.distinct( “customer_name" )

The syntax given above is an example of a distinct operation.

In the next section, we will discuss aggregation operations.

Aggregation Operations (contd.)

Group operations accept sets of documents as input which matches the given query, apply the operation, and then return an array of documents with the computed results.

A group does not support sharded collection data. In addition, the results of the group operation must not exceed 16 megabytes.

The group operation shown above groups documents by the field ‘a’, where ‘a’ is less than three and sums the field count for each group.  

In the next section, we will view a demo on how to use the group function in MongoDB.

Summary

Here is a quick recap of what was covered in this chapter:

  • Indexes are data structures that store data set in easily traversable form.

  • Indexes help execute queries efficiently without performing a collection scan.

  • MongoDB supports the following indexes—single field, compound, multikey, geospatial, text, and hashes.

  • For fast query operation, the system RAM must be able to accommodate index sizes.

  • You can create, modify, rebuild, and drop indexes.

  • The geospatial indexes help query geographic location by specifying a specific point.

  • The aggregation operations manipulate data and return a computed result based on the input and a specific procedure.

  • Aggregation functions in MongoDB help query operations such as calculating total sum spent by a customer on online shopping site.

Conclusion

This concludes the lesson Indexing and Aggregation in MongoDB. In the next lesson, we will discuss Replication and Sharding in MongoDB.

Find our MongoDB Developer and Administrator Online Classroom training classes in top cities:


Name Date Place
MongoDB Developer and Administrator 15 Sep -7 Oct 2018, Weekend batch Your City View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

We use cookies on this site for functional and analytical purposes. By using the site, you agree to be cookied and to our Terms of Use. Find out more

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)

By proceeding, you agree to our Terms of Use and Privacy Policy

We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*

By proceeding, you agree to our Terms of Use and Privacy Policy