Presenting an open source library that for training uses hardware acceleration to load and augment video frames: NVVL: NVIDIA Video Loader!

April 3, 2018, 9:57 a.m. By: Kirti Bakshi

NVVL: NVIDIA

NVVL the abbreviated form of NVIDIA Video Loader is a library that in order to facilitate machine learning training is used to load random sequences of video frames from compressed video files. It, in order to parse as well as read the compressed packets from video files, makes the use of FFmpeg's libraries and to further off-load and accelerate the decoding of those packets it makes the use of the video decoding hardware available on NVIDIA GPUs, providing a ready-for-training tensor in GPU device memory.

In addition to this, NVVL while loading the frames can perform data augmentation. The Frames making the use of the GPUs dedicated texture mapping units can be scaled, cropped, as well as flipped horizontally. Output can be in RGB or YCbCr colour space, normalized to [0, 1] or [0, 255], and in float, half, or uint8 tensors.

Making the use of compressed video files instead of individual frame image files significantly reduces during training the demands on the storage and I/O systems. Storage of video datasets as video files consumes an order of magnitude less disk space, allowing for larger datasets to both fit in system RAM as well as local SSDs for fast access. During the process of loading fewer bytes must be read from disk. Fitting on storage that is smaller as well as faster and at load time reading fewer bytes alleviates the bottleneck of retrieving data from disks, which will only get worse as GPUs get faster.

Also, making the use of the hardware decoder on NVIDIA GPUs in order to decode images significantly reduces the demands on the host CPU. This means during training fewer CPU cores need to be dedicated to data loading. This is especially important in servers that come with a large number of GPUs per CPU, such as the in the NVIDIA DGX-2 server, but also provides benefits for other platforms.

Most users rather than using the library directly will want to use the deep learning framework wrappers provided. Currently, a wrapper for PyTorch is provided and PR's for other frameworks are welcomed.

In short, NVVL for both the video decoding and augmentation (scaling, cropping, colour space conversion, etc) makes the use of the GPU hardware. So the entire batch of sequences comes out to be well packaged in a tensor that is sitting in device memory and is all set to go instead of having to bundle it up after decode and ship the uncompressed frames across PCIe to the device. The idea is this will reduce the load on the CPUs and also use less PCIe bandwidth than decoding on the CPU. This allows you to have a lot more GPUs per CPU in the system.

Build and Installing:

NVVL depends on the following:

  • CUDA Toolkit. There has been a test of versions 8.0 and later but it is the earlier versions may work. NVVL will perform better with CUDA 9.0 or later1.

  • FFmpeg's libavformat, libavcodec, libavfilter, and libavutil.

Additionally, in order to build from source requires CMake version 3.8 or above as CMake 3.8 and above provides us with a built-in CUDA language support that is used by NVVL's build system. Since relatively CMake 3.8 is new and not yet in widely used Linux distribution, it may be required for the installation of a new version of CMake.

Preparing Data:

In any container format that the H.264 and HEVC (H.265) video codecs that FFmpeg is able to parse, is supported by NVVL. Video codecs, as a complete image in the data stream only store certain frames, called keyframes or intra-frames. All other frames, to be decoded, require data from other frames, either before or after it in time.

In order to decode frames in a sequence it is necessary to start decoding at the keyframe before the sequence, and continue past the sequence to the next keyframe after it. This isn't a problem when you steam through a video sequentially; however, when randomly throughout the video decoding small sequences of frames, a large gap between keyframes results in reading and decoding a large number of frames that are never used.

Thus, when randomly reading short sequences from a video file so as to get good performance, it is necessary to encode the file with frequent keyframes. It has been found that the setting of the keyframe interval to the length of the sequences that are being read provides a good compromise between file size and loading performance.

To know more regarding the same, go through the links below:

Source And For More Information: GitHub