Skip to content

Online vs. Batch Serving

🚀 Online vs. Batch Serving

The architecture for delivering a model depends on whether you need a result in Milliseconds or Hours.

🟢 Level 1: Online Serving (Real-Time)

The model is exposed as a REST or gRPC API.

Latency: < 100ms.
Tools: FastAPI, BentoML, TorchServe.
Workflow:
1. Client sends JSON request.
2. Server performs preprocessing.
3. Server runs inference.
4. Server returns JSON response.

High-Speed Preprocessing

Standard Python can be slow for preprocessing (e.g., text tokenization). For high traffic, consider:

Rust/Go Sidecar: Handle data cleaning in a fast language.
Triton Inference Server: Optimized C++ engine for model execution.

🟡 Level 2: Batch Serving (Asynchronous)

The model processes a large dataset all at once.

Latency: Not a concern (Minutes to Hours).
Throughput: Massive (Millions of rows).
Tools: Spark, Airflow, Dask.
Workflow:
1. Scheduler triggers job at 2 AM.
2. Load 10M rows from Snowflake/Parquet.
3. Distribute inference across 100 worker nodes.
4. Write results back to the database.

🔴 Level 3: The Hybrid (Request-Response Batch)

Used when you have many requests but don’t need instant results.

Streaming Inference

Tools: Kafka, Flink.
Workflow:
1. Request is pushed to a Kafka topic.
2. Model service consumes the topic and performs inference.
3. Result is pushed to an “Output” topic.