Skip to content

Pipeline Architecture

Pipeline Architecture

The Aggregation Framework is a powerful data processing engine that allows you to transform and combine data from multiple documents into a single result. It operates on the concept of a Pipeline, where documents pass through a series of stages.


🏗️ 1. The Pipeline Concept

Think of the aggregation pipeline as an assembly line.

  1. Input: Documents from a collection enter the pipeline.
  2. Stages: Each stage transforms the document stream (e.g., filtering, grouping, sorting).
  3. Output: The final result is returned as a stream of documents or saved to a new collection.

Mongo Shell Example

db.orders.aggregate([
  // Stage 1: Filter active orders
  { $match: { status: "A" } },
  
  // Stage 2: Group by customer and sum total amount
  { $group: { 
      _id: "$customer_id", 
      totalSpent: { $sum: "$amount" } 
  } },
  
  // Stage 3: Sort by totalSpent descending
  { $sort: { totalSpent: -1 } }
]);

PyMongo Example

pipeline = [
    {"$match": {"status": "A"}},
    {"$group": {"_id": "$customer_id", "totalSpent": {"$sum": "$amount"}}},
    {"$sort": {"totalSpent": -1}}
]

results = list(db.orders.aggregate(pipeline))

🚀 2. Essential Pipeline Stages

$match (Filtering)

Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage.

$group (Aggregation)

Groups input documents by the specified _id expression and applies the accumulator expression(s).

  • $sum: Calculates the sum.
  • $avg: Calculates the average.
  • $push: Collects values into an array.

$project (Reshaping)

Passes along the documents with the requested fields to the next stage in the pipeline. You can add new fields or remove existing ones.

{ 
  $project: { 
    name: 1, 
    totalPrice: { $multiply: ["$price", "$quantity"] } 
  } 
}

$sort (Ordering)

Sorts all input documents and returns them to the pipeline in sorted order. (1 for ascending, -1 for descending).


⚡ 3. Performance & Optimization

  1. Memory Limits: Each stage is limited to 100MB of RAM. If you exceed this, MongoDB will throw an error unless you use { allowDiskUse: true }.
  2. Index Usage: Only the $match and $sort stages at the very beginning of the pipeline can take advantage of indexes.
  3. Pipeline Sequencing:
    • $match -> $sort -> $group is generally more efficient than sorting after grouping.

💡 Why Use Aggregation?

  • Efficiency: Processing happens on the database server, reducing the amount of data sent over the network.
  • Complexity: Handles transformations that would be extremely difficult or slow to do in application logic.
  • Real-time: Perfect for generating reports, dashboards, and live metrics.