Measure end-to-end response time vs inference time

On the model metrics tab, you can now use the dropdown menu to toggle between two different views for model inference time:

  • End-to-end response time includes time for cold starts, queuing, and inference (but not client-side latency). This most closely mirrors the performance of your model as experienced by your users.

  • Inference time includes just the time spent running the model, including pre- and post-processing. This is useful for optimizing the performance of your model code at the single replica level.