Building a 500+ FPS Accelerated Background Removal Pipeline in Python with Savant and OpenCV CUDA MOG2

Ivan Kud
Inside In-Sight
Published in
8 min readApr 10, 2023

--

UPD 2023.05.25: The article is updated to use Savant 0.2.2.

Background removal is a frequent operation in computer vision and video analytics, which can be used in various use cases, like the cut-off solution for inference performance optimization. So, being an auxiliary function, it must be cheap and very fast.

In the article, we will explore the demo of background removal with the CUDA-accelerated MOG2 background segmentation algorithm and Savant Video Analytics Framework. The demo is a high-performance pipeline delivering single stream processing at the rate of 570 FPS on the NVIDIA QUADRO RTX4000 GPU and 75 FPS on Jetson NX, processing the HD quality video. The result is a 2560x720 mosaic video displaying the original video and the outcome side by side, as may be seen in the following video:

For the sake of simplicity and demonstrability, we don’t scale down the initial video frames before the processing, which, in fact, may significantly improve the performance and quality of the pipeline.

The source code of the demo can be found on Savant’s GitHub.

About The Tools Used

Savant is a new high-level Python-based video analytics framework on top of Nvidia DeepStream. It focuses on building production-ready pipelines for Nvidia Edge and data center hardware. Savant wraps complex GStreamer/DeepStream internals providing the developer with convenient YAML-based pipeline configuration where the user constructs the processing pipeline with ready-to-use and custom Python blocks.

In addition, Savant delivers all the gears to communicate with the external world: extendable source/sink adapters, dockerized deployment, and a scalability model out-of-the-box; read more on Savant on the website. To get acquainted with Savant, investigate the getting started tutorial.

OpenCV CUDA is a hardware-accelerated extension to the OpenCV library, enabling the running of CUDA-accelerated algorithms on images kept in the GPU memory without downloading them to the CPU memory. It benefits Savant’s processing model because DeepStream SDK — a foundation of Savant, processes frames within the GPU. As a result, it helps dramatically increase image processing speed. However, OpenCV CUDA supports a relatively limited amount of OpenCV algorithms and operations, but those supported are highly efficient. Learn more on OpenCV CUDA on the website.

The Background Subtraction Pipeline

In the article, we will develop a very simple background subtraction pipeline looking as follows:

Savant is extensible with pyfunc blocks: we will implement a short Python class of 50 lines of code that accesses the original image on the left side of the frame, blurs the background to decrease the flicker, runs CUDA-accelerated MOG2 background segmentation algorithm, and draws the resulting image on the right side of the frame.

The pipeline declaration is made with the typical Savant’s YAML manifest. Let us begin with that manifest to understand what the pipeline delivers:

# module name, required
name: ${oc.env:MODULE_NAME, 'demo'}

# base module parameters
parameters:
# pipeline processing frame parameters
frame:
width: 1280
height: 720
# Add paddings to the frame before processing
padding:
# Paddings are kept on the output frame
keep: true
left: 0
right: 1280
top: 0
bottom: 0
output_frame:
# Frame is output without any encoding
# this is to circumvent 3 hardware decoding processes limit on NVIDIA consumer hardware
codec: raw-rgba
# PyFunc for drawing on frames
batch_size: 1
draw_func: {}


# pipeline definition
pipeline:
# source definition is skipped, zeromq source is used by default to connect with source adapters

# define pipeline's main elements
elements:
- element: pyfunc
# specify the pyfunc's python module
module: samples.opencv_cuda_bg_remover_mog2.bgremover
# specify the pyfunc's python class from the module
class_name: BgRemover
# sink definition is skipped, zeromq sink is used by default to connect with sink adapters

Let us clarify interesting parameters. This pipeline works on the 1280x720 resolution. It means that every input video stream will be scaled to that resolution. We also want to have the same size unused canvas on the right, so we configure padding:

    padding:
# Paddings are kept on the output frame
keep: true
left: 0
right: 1280
top: 0
bottom: 0

This configuration adds the same size black area on the right extending the video frame from 1280x720 to 2560x720. Also, we would like to specify the output frame format and the draw function:

  output_frame:
# Frame is output without any encoding
# this is to circumvent 3 hardware decoding processes limit on NVIDIA consumer hardware
codec: raw-rgba
# PyFunc for drawing on frames
batch_size: 1
draw_func: {}

We have specified the codec as raw-rgba , but in production, you must use h264 or h265. However, the GeForce series GPUs limit the quantity of simultaneously encoded streams to 3, so we avoid encoding in our examples when possible to be safe.

We also set the batch_size to 1 . The parameter is beneficial when processing multiple video streams with DNN-based inference pipelines, but it does not influence performance in our example.

As we don’t draw on the frame with Savant’s artist, we disabled the element.

The pipeline section is also trivial: we only implement a single pyfunc, which handles the frames with the background-removal code:

# pipeline definition
pipeline:
# source definition is skipped, zeromq source is used by default to connect with source adapters

# define pipeline's main elements
elements:
- element: pyfunc
# specify the pyfunc's python module
module: samples.opencv_cuda_bg_remover_mog2.bgremover
# specify the pyfunc's python class from the module
class_name: BgRemover
# sink definition is skipped, zeromq sink is used by default to connect with sink adapters

You may see that the pyfunc is named BgRemover and can be found in the module bgremover .

Pyfuncs have access to processed frames and their meta-information accumulated by the pipeline up to the call of the function. In our case, the BgRemover pyfunc removes the background. Let us investigate its code:

"""Background remover module."""
from savant.gstreamer import Gst
from savant.deepstream.meta.frame import NvDsFrameMeta
from savant.deepstream.pyfunc import NvDsPyFuncPlugin
from savant.utils.artist import Artist
from savant.deepstream.opencv_utils import (
nvds_to_gpu_mat,
)
import cv2

class BgRemover(NvDsPyFuncPlugin):
"""Background remover pyfunc.
The class is designed to process video frame metadata and remove the background from the frame.
MOG2 method from openCV is used to remove background.
"""

def __init__(self, **kwargs):
super().__init__(**kwargs)
self.stream = cv2.cuda.Stream_Null()
self.back_subtractors = {}

self.gaussian_filter = cv2.cuda.createGaussianFilter(
cv2.CV_8UC4, cv2.CV_8UC4, (9, 9), 2
)


def on_source_eos(self, source_id: str):
"""On source EOS event callback."""
if source_id is self.back_subtractors:
self.back_subtractors.pop(source_id)

def process_frame(self, buffer: Gst.Buffer, frame_meta: NvDsFrameMeta):
"""Process frame metadata.
:param buffer: Gstreamer buffer with this frame's data.
:param frame_meta: This frame's metadata.
"""
with nvds_to_gpu_mat(buffer, frame_meta.frame_meta) as frame_mat:
with Artist(frame_mat) as artist:
if frame_meta.source_id in self.back_subtractors:
back_sub = self.back_subtractors[frame_meta.source_id]
else:
back_sub = cv2.cuda.createBackgroundSubtractorMOG2()
self.back_subtractors[frame_meta.source_id] = back_sub
ref_frame = cv2.cuda_GpuMat(frame_mat, (0, 0, int(frame_meta.roi.width), int(frame_meta.roi.height)))
cropped = ref_frame.clone()
self.gaussian_filter.apply(cropped, cropped, stream=self.stream)
cu_mat_fg = back_sub.apply(cropped, -1, self.stream)
res_image = ref_frame.copyTo(cu_mat_fg, self.stream)
artist.add_graphic(res_image, (int(frame_meta.roi.width), 0))

The pyfunc instance is created upon the pipeline launch and allocated until the pipeline is alive.

The Constructor

The constructor defines the dictionary for subtractors and the Gaussian filter. Savant is designed to process multiple streams simultaneously: to make the code universally applicable, we must maintain the subtractor for every stream separately. That is why we are using a dictionary. The constructor also allocates the default CUDA stream to asynchronously handle the operations.

The EOS Handler

Sometimes streams terminate, sending the EOS signal. We use this event to remove the subtractor for the terminated stream. If the stream returns, the segmenter will be created for it again on demand (see later).

The Frame Processor Method

The method is invoked for every frame of every stream processed. Within the method, we get access to the DeepStream’s frame as OpenCV’s GpuMat with the utility function:

nvds_to_gpu_mat(buffer, frame_meta.frame_meta) as frame_mat

Finally, we run the Gaussian filter, and background segmentation and apply the results to the right area of the frame with the artist object provided by Savant:

ref_frame = cv2.cuda_GpuMat(frame_mat, (0, 0, int(frame_meta.roi.width), int(frame_meta.roi.height)))
cropped = ref_frame.clone()
self.gaussian_filter.apply(cropped, cropped, stream=self.stream)
cu_mat_fg = back_sub.apply(cropped, -1, self.stream)
res_image = ref_frame.copyTo(cu_mat_fg, self.stream)
artist.add_graphic(res_image, (int(frame_meta.roi.width), 0))

Running The Code

The demo can be run on a Jetson-based edge device or a PC-based platform with a discrete GPU. Make sure your runtime is configured properly. We have created a small guide that shows how to configure Ubuntu 22.04 runtime.

Requirements for the x86-based environment: Nvidia dGPU (Volta, Turing, Ampere, Ada Lovelace), Linux OS with driver 525+, Docker with Compose plugin installed and configured with Nvidia Container Runtime.

Requirements for the Jetson-based environment: Nvidia Jetson (NX/AGX, Orin NX/Nano/AGX) with JetPack 5.1+ (the framework does not support first-generation Jetson Nano), Docker with Compose plugin installed and configured with Nvidia Container Runtime.

git clone --depth 1 --branch v0.2.2 https://github.com/insight-platform/Savant.git
cd Savant/samples/opencv_cuda_bg_remover_mog2
git lfs pull

# if you want to share with us where are you from
# run the following command, it is completely optional
curl --silent -O -- https://hello.savant.video/opencv_cuda_bg_remover_mog2.html

# if x86
../../utils/check-environment-compatible && docker compose -f docker-compose.x86.yml up

# if Jetson
../../utils/check-environment-compatible && docker compose -f docker-compose.l4t.yml up

# open 'rtsp://127.0.0.1:554/stream' in your player
# or visit 'http://127.0.0.1:888/stream/' (LL-HLS)

# Ctrl+C to stop running the compose bundle

# to get back to project root
cd ../..

The pipeline uses the Video Loop Source adapter and the Always-On RTSP Sink adapter. Consider reading the article to get acquainted with the adapters and how Savant communicates with the external world.

Performance Measurement

To estimate the peak pipeline performance, you need to download the test video file to your local computer: we will run Savant in the way to feed the pipeline as fast as possible.

# download
# you are expected to be in Savant/ directory

mkdir -p data && curl -o data/road_traffic.mp4 \
https://eu-central-1.linodeobjects.com/savant-data/demo/road_traffic.mp4

Now you may run the pipeline with the performance-optimized configuration file and get the FPS specific to your computing environment:

PC:

# you are expected to be in Savant/ directory

docker run --rm -it --gpus=all \
-v $(pwd)/samples:/opt/savant/samples \
-v $(pwd)/data:/data:ro \
ghcr.io/insight-platform/savant-deepstream:0.2.2-6.2 \
samples/opencv_cuda_bg_remover_mog2/demo_performance.yml

Jetson:

docker run --rm -it --gpus=all \
-v $(pwd)/samples:/opt/savant/samples \
-v $(pwd)/data:/data:ro \
ghcr.io/insight-platform/savant-deepstream-l4t:0.2.2-6.2 \
samples/opencv_cuda_bg_remover_mog2/demo_performance.yml

After completing the processing, the module will print the number of processed frames and FPS to the log.

On a workstation with a Core i5–6400 and Nvidia Quadro RTX4000, the result is as follows:

2023-04-07 12:24:57,212 [savant.demo] [INFO] Processed 9184 frames, 
572.34 FPS.

On the AWS Tesla T4 instance with 4-core Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz the pipeline results to:

2023-04-08 14:41:17,924 [savant.demo] [INFO] Processed 9184 frames, 
464.28 FPS.

It is not a CPU-intensive pipeline, so GPU performance influences more than CPU.

Thus, in real-time stream processing mode, a single Quadro RTX4000 card can process up to 22 cameras with a frame rate of 25 FPS.

On Jetson NX, you may expect 75 FPS in a single-stream mode and up to 130 FPS launching two pipeline instances.

Conclusion

We have demonstrated how to run the MOG2 background removal algorithm on GPU with OpenCV CUDA and Savant framework. In the demo, we used the original resolution of the frame to run MOG2.

In practice, the video is usually downscaled before running the filter, improving the performance without losing significant capabilities or even improving the results by ignoring the influence of small flickering pieces in the frame.

To make it possible, you can use the right side of the frame to:

  • Step 1: scale the frame down two-fold (640x360) or four-fold (320x180) with OpenCV CUDA and draw it on the right side of the frame;
  • Step 2: run the background segmentation on the resulting right-side image received in Step 1.

We appreciate your interest in Savant technology and would happily answer your questions. Join us on GitHub Discussions and Discord.

--

--