Batch Inference Service
Role: The Consumer (Compute Bound / GPU)
Source: services/batch-inference
Responsibilities
This service is the "Brain" of the operation. It runs the YOLO neural network.
1. Dynamic Batching
It doesn't just process what it gets; it waits (up to BATCH_WAIT_MS) to accumulate enough frames to fill a GPU batch.
- Inputs: List of JPEGs from different camera sources.
- Output: Single Tensor
[N, 3, 640, 640].
2. TensorRT Optimization
We do not use raw PyTorch in production. We optimize models to TensorRT Engines (.engine).
- FP16 Quantization: Reduces memory usage by 50% with negligible accuracy loss.
- Layer Fusion: Combines multiple network layers into single kernel operations.
3. Result Dispatch
After inference, the service splits the batch results back into individual responses and tagging them with their original camera_id before pushing to the results queue.