Pipeline Architecture
Pipeline Architecture
The Aggregation Framework is a powerful data processing engine that allows you to transform and combine data from multiple documents into a single result. It operates on the concept of a Pipeline, where documents pass through a series of stages.
🏗️ 1. The Pipeline Concept
Think of the aggregation pipeline as an assembly line.
- Input: Documents from a collection enter the pipeline.
- Stages: Each stage transforms the document stream (e.g., filtering, grouping, sorting).
- Output: The final result is returned as a stream of documents or saved to a new collection.
Mongo Shell Example
db.orders.aggregate([
// Stage 1: Filter active orders
{ $match: { status: "A" } },
// Stage 2: Group by customer and sum total amount
{ $group: {
_id: "$customer_id",
totalSpent: { $sum: "$amount" }
} },
// Stage 3: Sort by totalSpent descending
{ $sort: { totalSpent: -1 } }
]);PyMongo Example
pipeline = [
{"$match": {"status": "A"}},
{"$group": {"_id": "$customer_id", "totalSpent": {"$sum": "$amount"}}},
{"$sort": {"totalSpent": -1}}
]
results = list(db.orders.aggregate(pipeline))🚀 2. Essential Pipeline Stages
$match (Filtering)
Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage.
$group (Aggregation)
Groups input documents by the specified _id expression and applies the accumulator expression(s).
$sum: Calculates the sum.$avg: Calculates the average.$push: Collects values into an array.
$project (Reshaping)
Passes along the documents with the requested fields to the next stage in the pipeline. You can add new fields or remove existing ones.
{
$project: {
name: 1,
totalPrice: { $multiply: ["$price", "$quantity"] }
}
}$sort (Ordering)
Sorts all input documents and returns them to the pipeline in sorted order. (1 for ascending, -1 for descending).
⚡ 3. Performance & Optimization
- Memory Limits: Each stage is limited to 100MB of RAM. If you exceed this, MongoDB will throw an error unless you use
{ allowDiskUse: true }. - Index Usage: Only the
$matchand$sortstages at the very beginning of the pipeline can take advantage of indexes. - Pipeline Sequencing:
$match->$sort->$groupis generally more efficient than sorting after grouping.
💡 Why Use Aggregation?
- Efficiency: Processing happens on the database server, reducing the amount of data sent over the network.
- Complexity: Handles transformations that would be extremely difficult or slow to do in application logic.
- Real-time: Perfect for generating reports, dashboards, and live metrics.