Speeding up inference with parallel model runs

Increasing the system throughput without compromising the accuracy

Published in

Analytics Vidhya

4 min readDec 16, 2020

When deploying a real-world application, accuracy is not everything for a deep learning model. Many edge applications require processing video frames as fast as possible, without losing any frames. There is an inherent trade off between the accuracy of the state of the art models and their inference time, as typically bigger models are more accurate but slower. In this context, how can we make use of the latest models, despite of their high inference times?

Latency and Throughput

First of all, we need to make the distinction between latency and throughput. We define latency as the amount of time necessary to convert an input to an output, for example, converting an input image into detected bounding boxes. That means latency is a unit of time. Throughput on the other hand, represents how often a new frame can be processed, which means that if the throughput is less than the rate of incoming frames, some frames will be missed. Throughput thus is measured in frames per second (fps). Depending on the application, it might be important prioritizing one versus the other.

Making Latency and Throughput independent

In a sequential pipeline, throughput is the inverse of latency (1 / latency), which is the usual first choice when designing a system. However, it is possible to make them independent by having a multiprocess or multithreaded system. One solution would be queueing all the incoming frames and processing them one by one in a separate process. However, if latency is higher than the incoming frame rate, latency will grow over time, as the delay will propagate for each new frame. While this would be ok in a scenario where the number of frames is limited and not losing frames is more important than the latency, this is prohibitive in many scenarios with a constant flow of incoming frames.

Multiprocessing the frame queue

A solution to this problem would be having more than one process taking care of the incoming frame queue. Basically, if a new frame comes every 30 ms and the latency of processing a single frame is 50ms, having two processes dealing with the queue in parallel would be enough to handle the queue without creating propagating delays. In this case, each process will have 60 ms to process every 2 frames, making the 50ms latency adequate (if 50 ms latency is indeed acceptable for the system). Here is a block diagram of the scheme:

The problem:

Here are some results using this solution for an object detection network running on a TeslaV100 GPU with a~50 ms latency:

Using 1 process:

Using 2 processes:

Using 3 processes:

Here, start and end represent the starting and ending times of the inference (in seconds) while diff represents the latency. As we can see this is not really working, as the latency for a single process is doubling for 2 processes and 3 times higher for 3 processes. Why does this happen?

After some research, I found out this is because of the way GPUs are designed. GPUs are divided into processing cores called Streaming Multiprocessors (SM) which are designed to be fully occupied whenever possible. This means that a single layer in a network can occupy all the SM’s at one time, blocking any computations until this layer finishes processing. Thus, two models running in parallel will compete for the resources and are very likely to start and finish together, taking turns in using the resources.

Solution

With all of this being said, the solution is to add more devices. If each process has its own GPU, they will not compete for resources. Here are the results for 2 processes running on separate GPUs:

This clearly proves that the processes are not competing for resources anymore, and each process shows the original latency.

Although this might be a financially expensive solution, it’s one that doesn’t compromise the accuracy of the system, like changing to a smaller model or performing model compression, such as in quantization or pruning. Also, if throughput is more important than latency, having several less powerful GPUs might be preferable over a single fast one.

Analytics Vidhya

Speeding up inference with parallel model runs

Increasing the system throughput without compromising the accuracy

Published in Analytics Vidhya

Written by Rafael Iriya

No responses yet