使用 OpenCV 对视频进行 Alpha 混合 [英] Alpha Blending using OpenCV for videos

查看:80
本文介绍了使用 OpenCV 对视频进行 Alpha 混合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 alpha 视频将一个视频混合到另一个视频之上.这是我的代码.它工作得很好,但问题是这段代码根本没有效率,这是因为 /255 部分.它很慢并且有滞后问题.

是否有标准且有效的方法来执行此操作?我希望结果是实时的.谢谢

导入 cv2将 numpy 导入为 np定义主():前景 = cv2.VideoCapture('circle.mp4')背景 = cv2.VideoCapture('video.MP4')alpha = cv2.VideoCapture('circle_alpha.mp4')而foreground.isOpened():fr_foreground = foreground.read()[1]/255fr_background = background.read()[1]/255fr_alpha = alpha.read()[1]/255cv2.imshow('我的图像',cmb(fr_foreground,fr_background,fr_alpha))if cv2.waitKey(1) == ord('q'): 中断cv2.destroyAllWindowsdef cmb(fg,bg,a):返回 fg * a + bg * (1-a)如果 __name__ == '__main__':主要的()

解决方案

让我们先解决一些明显的问题 - foreground.isOpened() 即使在你到达结束后也会返回 true的视频,所以你的程序最终会崩溃.解决方案是双重的.首先,在创建它们后立即测试所有 3 个 VideoCapture 实例,使用类似:

如果不是 foreground.isOpened() 或者不是 background.isOpened() 或者不是 alpha.isOpened():打印无法打开输入视频."返回

这将确保所有这些都正确打开.下一部分是正确处理到达视频的结尾.这意味着要么检查 read() 的两个返回值中的第一个,这是一个表示成功的布尔标志,或者测试帧是否为 None.

虽然为真:r_fg, fr_foreground = foreground.read()r_bg, fr_background = background.read()r_a, fr_alpha = alpha.read()如果不是 r_fg 或不是 r_bg 或不是 r_a:break # 视频结束

此外,您似乎并没有真正调用 cv2.destroyAllWindows() -- () 丢失了.并不是说这真的很重要.

<小时>

为了帮助调查和优化这一点,我使用timeit 模块和几个方便的函数添加了一些详细的计时

from timeit import default_timer 作为定时器def update_times(times, total_times):对于范围内的 i(len(times) - 1):total_times[i] += (times[i+1]-times[i]) * 1000def print_times(total_times, n):打印迭代次数:%d"%n对于我在范围内(len(total_times)):打印步骤 %d: %0.4f ms" % (i, total_times[i]/n)打印总计:%0.4f ms"% (np.sum(total_times)/n)

并修改了 main() 函数以测量每个逻辑步骤所花费的时间——读取、缩放、混合、显示、等待键.为此,我将部门拆分为单独的语句.我还做了一个小小的修改,使它在 Python 2.x 中也能工作(/255 被解释为整数除法并产生错误的结果).

times = [0.0] * 6total_times = [0.0] * (len(times) - 1)n = 0为真:时间[0] = 计时器()r_fg, fr_foreground = foreground.read()r_bg, fr_background = background.read()r_a, fr_alpha = alpha.read()如果不是 r_fg 或不是 r_bg 或不是 r_a:break # 视频结束时间[1] = 计时器()fr_foreground = fr_foreground/255.0fr_background = fr_background/255.0fr_alpha = fr_alpha/255.0时间[2] = 计时器()结果 = cmb(fr_foreground,fr_background,fr_alpha)时间[3] = 计时器()cv2.imshow('我的图片', 结果)时间[4] = 计时器()if cv2.waitKey(1) == ord('q'): 中断时间[5] = 计时器()update_times(times, total_times)n += 1打印次数(总次数,n)

当我使用 1280x800 mp4 视频作为输入运行它时,我注意到它确实相当缓慢,而且它在我的 6 核机器上只使用了 15% 的 CPU.各部分时间安排如下:

迭代次数:1190第 0 步:11.4385 毫秒第 1 步:37.1320 毫秒第 2 步:39.4083 毫秒第 3 步:2.5488 毫秒第 4 步:10.7083 毫秒总计:101.2358 毫秒

这表明最大的瓶颈是缩放步骤和混合步骤.低 CPU 使用率也不是最理想的,但让我们首先关注容易实现的目标.

<小时>

让我们看看我们使用的 numpy 数组的数据类型.read() 为我们提供了 np.uint8dtype 数组——8 位无符号整数.然而,浮点除法(如所写)将产生一个 dtype 的数组,其中 np.float64 -- 64 位浮点值.我们的算法并不真正需要这种级别的精度,所以我们最好只使用 32 位浮点数——这意味着如果任何操作被向量化,我们可能会在相同的情况下进行两倍的计算时间.

这里有两个选项.我们可以简单地将除数转换为 np.float32,这将导致 numpy 为我们提供具有相同 dtype 的结果:

fr_foreground = fr_foreground/np.float32(255.0)fr_background = fr_background/np.float32(255.0)fr_alpha = fr_alpha/np.float32(255.0)

这给了我们以下时间:

迭代次数:1786第 0 步:9.2550 毫秒第 1 步:19.0144 毫秒第 2 步:21.2120 毫秒第 3 步:1.4662 毫秒第 4 步:10.8889 毫秒总计:61.8365 毫秒

或者我们可以先将数组转换为 np.float32,然后就地进行缩放.

fr_foreground = np.float32(fr_foreground)fr_background = np.float32(fr_background)fr_alpha = np.float32(fr_alpha)fr_foreground/= 255.0fr_background/= 255.0fr_alpha/= 255.0

给出以下时序(将步骤 1 拆分为转换 (1) 和缩放 (2) -- 其余移位 1):

迭代次数:1786第 0 步:9.0589 毫秒第 1 步:13.9614 毫秒第 2 步:4.5960 毫秒第 3 步:20.9279 毫秒第 4 步:1.4631 毫秒第 5 步:10.4396 毫秒总计:60.4469 毫秒

两者大致相同,运行时间约为原始时间的 60%.我将坚持使用第二个选项,因为它将在后面的步骤中变得有用.让我们看看还有什么可以改进.

<小时>

从前面的时间,我们可以看到缩放不再是瓶颈,但仍然想到一个想法——除法通常比乘法慢,那么如果我们乘以一个倒数呢?

fr_foreground *= 1/255.0fr_background *= 1/255.0fr_alpha *= 1/255.0

确实,这确实为我们增加了一毫秒——没什么了不起的,但它很容易,所以不妨继续:

迭代次数:1786第 0 步:9.1843 毫秒第 1 步:14.2349 毫秒第 2 步:3.5752 毫秒第 3 步:21.0545 毫秒第 4 步:1.4692 毫秒第 5 步:10.6917 毫秒总计:60.2097 毫秒

<小时>

现在混合函数是最大的瓶颈,其次是所有 3 个数组的类型转换.如果我们看看混合操作的作用:

前景 * alpha + 背景 * (1.0 - alpha)

我们可以观察到,要使数学起作用,唯一需要在 (0.0, 1.0) 范围内的值是 alpha.

如果我们只缩放 alpha 图像会怎样?另外,由于乘以浮点数会提升为浮点数,如果我们也跳过类型转换怎么办?这意味着 cmb() 必须返回 np.uint8 数组

def cmb(fg,bg,a):返回 np.uint8(fg * a + bg * (1-a))

我们会

 #fr_foreground = np.float32(fr_foreground)#fr_background = np.float32(fr_background)fr_alpha = np.float32(fr_alpha)#fr_foreground *= 1/255.0#fr_background *= 1/255.0fr_alpha *= 1/255.0

时间是

第 0 步:7.7023 毫秒第 1 步:4.6758 毫秒第 2 步:1.1061 毫秒第 3 步:27.3188 毫秒第 4 步:0.4783 毫秒第 5 步:9.0027 毫秒总计:50.2840 毫秒

显然,第 1 步和第 2 步要快得多,因为我们只完成了 1/3 的工作.imshow 也加快了速度,因为它不必从浮点转换.令人费解的是,读取也变得更快了(我想我们正在避免一些幕后的重新分配,因为 fr_foregroundfr_background 总是包含原始帧).我们确实为 cmb() 中的额外演员付出了代价,但总的来说,这似乎是一个胜利——我们处于原始时间的 50%.

<小时>

继续,让我们去掉 cmb() 函数,将它的功能移到 main() 并拆分它以衡量每个操作的成本.让我们也尝试重用 alpha.read() 的结果(因为我们最近看到了 read() 性能的改进):

times = [0.0] * 11total_times = [0.0] * (len(times) - 1)n = 0为真:时间[0] = 计时器()r_fg, fr_foreground = foreground.read()r_bg, fr_background = background.read()r_a, fr_alpha_raw = alpha.read()如果不是 r_fg 或不是 r_bg 或不是 r_a:break # 视频结束时间[1] = 计时器()fr_alpha = np.float32(fr_alpha_raw)时间[2] = 计时器()fr_alpha *= 1/255.0时间[3] = 计时器()fr_alpha_inv = 1.0 - fr_alpha时间[4] = 计时器()fr_fg_weighed = fr_foreground * fr_alpha时间[5] = 计时器()fr_bg_weighed = fr_background * fr_alpha_inv时间[6] = 计时器()总和 = fr_fg_weighed + fr_bg_weighed时间[7] = 计时器()结果 = np.uint8(sum)时间[8] = 计时器()cv2.imshow('我的图片', 结果)时间[9] = 计时器()if cv2.waitKey(1) == ord('q'): 中断时间[10] = 计时器()update_times(times, total_times)n += 1

新时间:

迭代次数:1786第 0 步:6.8733 毫秒第 1 步:5.2742 毫秒第 2 步:1.1430 毫秒第 3 步:4.5800 毫秒第 4 步:7.0372 毫秒第 5 步:7.0675 毫秒第 6 步:5.3082 毫秒第 7 步:2.6912 毫秒第 8 步:0.4658 毫秒第 9 步:9.6966 毫秒总计:50.1372 毫秒

我们没有真正获得任何东西,但读取速度明显加快.

<小时>

这引出了另一个想法——如果我们尝试最小化分配并在后续迭代中重用数组会怎样?

我们可以在第一次迭代中预先分配必要的数组(使用

WIP 在我写这篇文章时查看评论以获取更多信息.

I want to blend a video on top of another one using an alpha video. This is my code. It works perfectly but the problem is that this code isn't efficient at all and that's because of /255 parts. It is slow and has lagging probelm.

Is there a standard and efficient way for doing this? I want results be real-time. Thanks

import cv2
import numpy as np

def main():
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alpha = cv2.VideoCapture('circle_alpha.mp4')

    while foreground.isOpened():
        fr_foreground = foreground.read()[1]/255
        fr_background = background.read()[1]/255     
        fr_alpha = alpha.read()[1]/255

        cv2.imshow('My Image',cmb(fr_foreground,fr_background,fr_alpha))

        if cv2.waitKey(1) == ord('q'): break

    cv2.destroyAllWindows

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

if __name__ == '__main__':
    main()

解决方案

Let's get few obvious problems out of the way first - foreground.isOpened() will return true even after you have reached the end of the video, so your program will end up crashing at that point. The solution is twofold. First of all, test all 3 VideoCapture instances right after you create them, using something like:

if not foreground.isOpened() or not background.isOpened() or not alpha.isOpened():
    print "Unable to open input videos."
    return

That will make sure that all of them opened properly. Next part is tho correctly handle reaching the end of the video. That means either checking the first of the two return values of read(), which is a boolean flag representing success, or test whether the frame is None.

while True:
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

Additionally, it seems you don't actually call cv2.destroyAllWindows() -- the () is missing. Not that this really matters.


To help investigate and optimize this, i've added some detailed timing, using the timeit module and couple of convenience functions

from timeit import default_timer as timer

def update_times(times, total_times):
    for i in range(len(times) - 1):
        total_times[i] += (times[i+1]-times[i]) * 1000

def print_times(total_times, n):
    print "Iterations: %d" % n
    for i in range(len(total_times)):
        print "Step %d: %0.4f ms" % (i, total_times[i] / n)
    print "Total: %0.4f ms" % (np.sum(total_times) / n)

and modified the main() function to measure the time taken by each logical step -- read, scale, blend, show, waitKey. To do this I split the division into separate statements. I also made a slight modification that makes this work in Python 2.x as well (/255 is interpeted as integer division and yields wrong results).

times = [0.0] * 6
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video
    times[1] = timer()
    fr_foreground = fr_foreground / 255.0
    fr_background = fr_background / 255.0
    fr_alpha = fr_alpha / 255.0
    times[2] = timer()
    result = cmb(fr_foreground,fr_background,fr_alpha)
    times[3] = timer()
    cv2.imshow('My Image', result)
    times[4] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[5] = timer()
    update_times(times, total_times)
    n += 1

print_times(total_times, n)

When I run this with 1280x800 mp4 videos as input, I notice that it really is rather sluggish, and that it only uses 15% CPU on my 6 core machine. The timing of the sections is as follows:

Iterations: 1190
Step 0: 11.4385 ms
Step 1: 37.1320 ms
Step 2: 39.4083 ms
Step 3: 2.5488 ms
Step 4: 10.7083 ms
Total: 101.2358 ms

This suggests that the biggest bottlenecks are both the scaling step and the blending step. The low CPU usage is also suboptimal, but let's focus on the low-hanging fruit first.


Let's look at the data types of the numpy arrays we use. read() gives us arrays with dtype of np.uint8 -- 8bit unsigned integers. However, the floating point division (as written) will yield an array with dtype of np.float64 -- 64bit floating point values. We don't really need this level of precision for our algorithm, so we'd be better of using only 32bit floats -- it would mean that if any of the operations are vectorized, we can potentially do twice as many calculations in the same amount of time.

There are two options here. We could simply cast the divisor to np.float32, which will cause numpy to give us the result with the same dtype:

fr_foreground = fr_foreground / np.float32(255.0)
fr_background = fr_background / np.float32(255.0)
fr_alpha = fr_alpha / np.float32(255.0)

Which gives us the following timings:

Iterations: 1786
Step 0: 9.2550 ms
Step 1: 19.0144 ms
Step 2: 21.2120 ms
Step 3: 1.4662 ms
Step 4: 10.8889 ms
Total: 61.8365 ms

Or we could cast the array to np.float32 first, and then do the scaling in-place.

fr_foreground = np.float32(fr_foreground)
fr_background = np.float32(fr_background)
fr_alpha = np.float32(fr_alpha)

fr_foreground /= 255.0
fr_background /= 255.0
fr_alpha /= 255.0

Which gives the following timings (splitting step 1 into conversion (1) and scaling (2) -- rest shifts by 1):

Iterations: 1786
Step 0: 9.0589 ms
Step 1: 13.9614 ms
Step 2: 4.5960 ms
Step 3: 20.9279 ms
Step 4: 1.4631 ms
Step 5: 10.4396 ms
Total: 60.4469 ms

Both are roughtly equivalent, running at ~60% of the original time. I'll stick with the second option, since it will become useful in the later steps. Let's see what else we can improve.


From the previous timings, we can see that the scaling is no longer the bottleneck, but an idea still comes to mind -- division is generally slower than multiplication, so what if we multiplied by a reciprocal?

fr_foreground *= 1/255.0
fr_background *= 1/255.0
fr_alpha *= 1/255.0

Indeed this does gain us a millisecond -- nothing spectacular, but it was easy, so might as well go with it:

Iterations: 1786
Step 0: 9.1843 ms
Step 1: 14.2349 ms
Step 2: 3.5752 ms
Step 3: 21.0545 ms
Step 4: 1.4692 ms
Step 5: 10.6917 ms
Total: 60.2097 ms


Now the blending function is the biggest bottleneck, followed by the typecast of all 3 arrays. If we look at what the blending operation does:

foreground * alpha + background * (1.0 - alpha)

we can observe that for the math to work, the only value that needs to be in range (0.0, 1.0) is alpha.

What if we only scaled only the alpha image? Also, since multiplication by floating point will promote to floating point, what if we also skipped the type conversion? That would mean cmb() would have to return np.uint8 array

def cmb(fg,bg,a):
    return np.uint8(fg * a + bg * (1-a))

and we would have

    #fr_foreground = np.float32(fr_foreground)
    #fr_background = np.float32(fr_background)
    fr_alpha = np.float32(fr_alpha)

    #fr_foreground *= 1/255.0
    #fr_background *= 1/255.0
    fr_alpha *= 1/255.0

The times for this are

Step 0: 7.7023 ms
Step 1: 4.6758 ms
Step 2: 1.1061 ms
Step 3: 27.3188 ms
Step 4: 0.4783 ms
Step 5: 9.0027 ms
Total: 50.2840 ms

Obviously, steps 1 and 2 are much faster, since we only do 1/3 of the work. imshow also speeds up, since it doens't have to convert from floating point. Inexplicably, the reads also got faster (I guess we're avoiding some under the hood reallocations, since fr_foreground and fr_background always contain the pristine frame). We do pay the price of an additional cast in cmb(), but overall this seems a win -- we're at 50% of the original time.


To continue, let's get rid of the cmb() function, move its functionality to main() and split it up to measure the cost of each of the operations. Let's also try to reuse the result of alpha.read() (since we recently saw that improvement in read() performance):

times = [0.0] * 11
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    times[1] = timer()
    fr_alpha = np.float32(fr_alpha_raw)
    times[2] = timer()
    fr_alpha *= 1/255.0
    times[3] = timer()
    fr_alpha_inv = 1.0 - fr_alpha
    times[4] = timer()
    fr_fg_weighed = fr_foreground * fr_alpha
    times[5] = timer()
    fr_bg_weighed = fr_background * fr_alpha_inv
    times[6] = timer()
    sum = fr_fg_weighed + fr_bg_weighed
    times[7] = timer()
    result = np.uint8(sum)
    times[8] = timer()
    cv2.imshow('My Image', result)
    times[9] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[10] = timer()
    update_times(times, total_times)
    n += 1

New timings:

Iterations: 1786
Step 0: 6.8733 ms
Step 1: 5.2742 ms
Step 2: 1.1430 ms
Step 3: 4.5800 ms
Step 4: 7.0372 ms
Step 5: 7.0675 ms
Step 6: 5.3082 ms
Step 7: 2.6912 ms
Step 8: 0.4658 ms
Step 9: 9.6966 ms
Total: 50.1372 ms

We didn't really gain anything, but the reads got noticeably faster.


This leads to another idea -- what if we tried to minimize allocations and reuse the arrays in subsequent iterations?

We can pre-allocate the necessary arrays in the first iteration (using numpy.zeros_like), after we read the first set of frames:

if n == 0: # Pre-allocate
    fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
    fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
    fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    sum = np.zeros_like(fr_alpha_raw, np.float32)
    result = np.zeros_like(fr_alpha_raw, np.uint8)

Now, we can use

We can also merge steps 1 and 2 together, using a single numpy.multiply.

times = [0.0] * 10
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    if n == 0: # Pre-allocate
        fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
        fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
        fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        sum = np.zeros_like(fr_alpha_raw, np.float32)
        result = np.zeros_like(fr_alpha_raw, np.uint8)

    times[1] = timer()
    np.multiply(fr_alpha_raw, np.float32(1/255.0), fr_alpha)
    times[2] = timer()
    np.subtract(1.0, fr_alpha, fr_alpha_inv)
    times[3] = timer()
    np.multiply(fr_foreground, fr_alpha, fr_fg_weighed)
    times[4] = timer()
    np.multiply(fr_background, fr_alpha_inv, fr_bg_weighed)
    times[5] = timer()
    np.add(fr_fg_weighed, fr_bg_weighed, sum)
    times[6] = timer()
    np.copyto(result, sum, 'unsafe')
    times[7] = timer()
    cv2.imshow('My Image', result)
    times[8] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[9] = timer()
    update_times(times, total_times)
    n += 1

This gives us the following timings:

Iterations: 1786
Step 0: 7.0515 ms
Step 1: 3.8839 ms
Step 2: 1.9080 ms
Step 3: 4.5198 ms
Step 4: 4.3871 ms
Step 5: 2.7576 ms
Step 6: 1.9273 ms
Step 7: 0.4382 ms
Step 8: 7.2340 ms
Total: 34.1074 ms

Significant improvement in all the steps we modified. We're down to ~35% of the time needed by the original implementation.


Minor update:

Based on Silencer's answer I measured cv2.convertScaleAbs as well. It actually runs a bit faster:

Step 6: 1.2318 ms

That gave me another idea -- we could take advantage of cv2.add which let's us specify the destination data type and does a saturation cast as well. This would allow us to combine steps 5 and 6 together.

cv2.add(fr_fg_weighed, fr_bg_weighed, result, dtype=cv2.CV_8UC3)

which comes out at

Step 5: 3.3621 ms

Again a little win (previously we were around 3.9ms).

Following on from this, cv2.subtract and cv2.multiply are further candidates. We need to use a 4-element tuple to define a scalar (intricacy of the Python bindings), and we need to explicitly define an output data type for multiplication.

    cv2.subtract((1.0, 1.0, 1.0, 0.0), fr_alpha, fr_alpha_inv)
    cv2.multiply(fr_foreground, fr_alpha, fr_fg_weighed, dtype=cv2.CV_32FC3)
    cv2.multiply(fr_background, fr_alpha_inv, fr_bg_weighed, dtype=cv2.CV_32FC3)

Timings:

Step 2: 2.1897 ms
Step 3: 2.8981 ms
Step 4: 2.9066 ms


This seems to be about as far as we can get without some parallelization. We're already advantage of whatever OpenCV may provide in terms of individual operations, so we should focus on pipe-lining our implementation.

To help me figure out how to partition the code between the different piepeline stages (threads), I made a chart that shows all the operations, our best times for them, as well as the inter-dependencies for the calcuations:

WIP see comments for additional info while I write this up.

这篇关于使用 OpenCV 对视频进行 Alpha 混合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆