Online vs. Batch Serving
π Online vs. Batch Serving
The architecture for delivering a model depends on whether you need a result in Milliseconds or Hours.
π’ Level 1: Online Serving (Real-Time)
The model is exposed as a REST or gRPC API.
- Latency: < 100ms.
- Tools: FastAPI, BentoML, TorchServe.
- Workflow:
- Client sends JSON request.
- Server performs preprocessing.
- Server runs inference.
- Server returns JSON response.
High-Speed Preprocessing
Standard Python can be slow for preprocessing (e.g., text tokenization). For high traffic, consider:
- Rust/Go Sidecar: Handle data cleaning in a fast language.
- Triton Inference Server: Optimized C++ engine for model execution.
π‘ Level 2: Batch Serving (Asynchronous)
The model processes a large dataset all at once.
- Latency: Not a concern (Minutes to Hours).
- Throughput: Massive (Millions of rows).
- Tools: Spark, Airflow, Dask.
- Workflow:
- Scheduler triggers job at 2 AM.
- Load 10M rows from Snowflake/Parquet.
- Distribute inference across 100 worker nodes.
- Write results back to the database.
π΄ Level 3: The Hybrid (Request-Response Batch)
Used when you have many requests but donβt need instant results.
Streaming Inference
- Tools: Kafka, Flink.
- Workflow:
- Request is pushed to a Kafka topic.
- Model service consumes the topic and performs inference.
- Result is pushed to an βOutputβ topic.