Skip to main content

Batch Inference Service

Role: The Consumer (Compute Bound / GPU) Source: services/batch-inference

Responsibilities

This service is the "Brain" of the operation. It runs the YOLO neural network.

1. Dynamic Batching

It doesn't just process what it gets; it waits (up to BATCH_WAIT_MS) to accumulate enough frames to fill a GPU batch.

  • Inputs: List of JPEGs from different camera sources.
  • Output: Single Tensor [N, 3, 640, 640].

2. TensorRT Optimization

We do not use raw PyTorch in production. We optimize models to TensorRT Engines (.engine).

  • FP16 Quantization: Reduces memory usage by 50% with negligible accuracy loss.
  • Layer Fusion: Combines multiple network layers into single kernel operations.

3. Result Dispatch

After inference, the service splits the batch results back into individual responses and tagging them with their original camera_id before pushing to the results queue.