在Matlab中执行帧内预测 [英] Performing Intra-frame Prediction in Matlab

查看:122
本文介绍了在Matlab中执行帧内预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现H.264/MPEG-4视频标准中使用的混合视频编码框架,为此我需要执行帧内预测"和帧间预测"(换句话说,运动估计)一组30帧,用于在Matlab中进行视频处理.我正在使用母女相框.

I am trying to implement a hybrid video coding framework which is used in the H.264/MPEG-4 video standard for which I need to perform 'Intra-frame Prediction' and 'Inter Prediction' (which in other words is motion estimation) of a set of 30 frames for video processing in Matlab. I am working with Mother-daughter frames.

请注意,该帖子与我之前问过的

Please note that this post is very similar to my previously asked question but this one is solely based on Matlab computation.

修改: 我正在尝试实现如下所示的框架:

I am trying to implement the framework shown below:

我的问题是如何执行水平编码方法,这是帧内编码框架的九种方法之一?像素如何采样?

My question is how to perform horizontal coding method which is one of the nine methods of Intra Coding framework? How are the pixels sampled?

我感到困惑的是,帧内预测需要两个输入,分别是输入帧的8x8块和重构帧的8x8块.但是,在对输入帧的第一个块进行编码时会发生什么情况,因为将没有重构的像素来执行水平编码?

What I find confusing is that Intra Prediction needs two inputs which are the 8x8 blocks of input frame and the 8x8 blocks of reconstructed frame. But what happens when coding the very first block of the input frame since there will be no reconstructed pixels to perform horizontal coding?

在整个系统上方的图像中,是一个闭环,您从哪里开始?

In the image above the whole system is a closed loop where do you start?

END:

问题1 :帧内预测图像是仅用于序列的第一幅图像(I帧)还是需要针对所有30帧进行计算?

Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?

我知道有5种帧内编码模式,分别是水平,垂直,DC,从左上到右下和从右上到左下.

I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.

问题2 :如何比较重建的帧和锚帧(原始当前帧)?

Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?

问题3 :为什么需要搜索区域?可以将单个8x8块用作一次完成一个像素的搜索区域吗?

Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?

我知道重建块中的像素用于比较,但是是否在搜索区域内一次完成一个像素?如果这样的话,要处理30帧不是太耗时吗?

I know that pixels from reconstructed block are used for comparing, but is it done one pixel at a time within the search area? If so wouldn't that be too time consuming if 30 frames are to be processed?

推荐答案

从上一篇文章继续,让我们一次回答一个问题.

Continuing on from our previous post, let's answer one question at a time.

通常,您使用一个I框架并将其表示为参考框架.一旦使用了此功能,对于参考帧中的每个8 x 8块,您就可以查看下一帧,并找出该8 x 8块在下一帧中移动得最好的位置.您将此位移描述为运动矢量,并构造了一个包含此信息的P帧.这会告诉您参考帧中的8 x 8块在此帧中移动得最好的地方.

Usually, you use one I-frame and denote this as the reference frame. Once you use this, for each 8 x 8 block that's in your reference frame, you take a look at the next frame and figure out where this 8 x 8 block best moved in this next frame. You describe this displacement as a motion vector and you construct a P-frame that consists of this information. This tells you where the 8 x 8 block from the reference frame best moved in this frame.

现在,您可能要问的下一个问题是在我们决定使用另一个参考框架之前,需要多少帧?这完全取决于您,您可以在解码器设置中进行设置.对于数字广播和DVD存储,建议您每0.5秒左右生成一个I帧.假设每秒24帧,这意味着您需要每12帧生成一个I帧. 这篇Wikipedia文章我有这个参考.

Now, the next question you may be asking is how many frames is it going to take before we decide to use another reference frame? This is entirely up to you, and you set this up in your decoder settings. For digital broadcast and DVD storage, it is recommended that you generate an I-frame every 0.5 seconds or so. Assuming 24 frames per second, this means that you would need to generate an I-frame every 12 frames. This Wikipedia article was where I got this reference.

对于帧内编码模式,它们告诉编码器在尝试寻找最佳匹配块时应以哪个方向寻找.实际上,请看看本文讨论了不同的预测模式.看一下图1,它提供了各种预测模式的非常不错的摘要.实际上,共有 9 个.还可以在此Wikipedia文章中 来获得不同图片的更好图示以及预测机制.为了获得最佳精度,他们还通过在像素之间进行双线性插值来以1/4像素的精度进行子像素估计.

As for the intra-coding modes, these tell the encoder in what direction you should look for when trying to find the best matching block. Actually, take a look at this paper that talks about the different prediction modes. Take a look at Figure 1, and it provides a very nice summary of the various prediction modes. In fact, there are nine all together. Also take a look at this Wikipedia article to get better pictorial representations of the different mechanisms of prediction as well. In order to get the best accuracy, they also do subpixel estimation at a 1/4 pixel accuracy by doing bilinear interpolation in between the pixels.

我不确定您是否仅需要对P帧实施运动补偿,或者是否还需要B帧.我将假设您同时需要两者.因此,请看一下我摘自Wikipedia的以下图表:

I'm not sure whether or not you need to implement just motion compensation with P-frames, or if you need B frames as well. I'm going to assume you'll be needing both. As such, take a look at this diagram I pulled off of Wikipedia:

来源:维基百科

这是对视频中的帧进行编码的非常常见的序列.它采用以下格式:

This is a very common sequence for encoding frames in your video. It follows the format of:

IBBPBBPBBI...

底部有一个时间轴,告诉您对帧进行编码后发送到解码器的帧序列. I帧首先需要编码,然后是P帧,然后是B帧.在I帧之间编码的典型帧序列遵循您在图中看到的这种格式. I帧之间的帧块就是所谓的图片组(GOP).如果您还记得我们以前的文章,B帧会使用当前位置前后的信息.因此,总结时间线,通常是在编码器端进行的操作:

There is a time axis at the bottom that tells you the sequence of frames that get sent to the decoder once you encode the frames. I-frames need to be encoded first, followed by P-frames, and then B-frames. A typical sequence of frames that are encoded in between the I-frames follow this format that you see in the figure. The chunk of frames in between I-frames is what is known as a Group of Pictures (GOP). If you remember from our previous post, B-frames use information from ahead and from behind its current position. As such, to summarize the timeline, this is what is usually done on the encoder side:

  • I帧经过编码,然后用于预测第一个P帧
  • 然后使用第一个I帧和第一个P帧预测这些帧之间的第一个和第二个B帧
  • 使用第一个P帧预测第二个P帧,并使用第一个P帧和第二个P帧之间的信息创建第三个和第四个B帧
  • 最后,GOP中的最后一帧是I帧.对此进行编码,然后使用第二个P帧和第二个I帧(最后一个帧)之间的信息生成第五个和第六个B帧

因此,需要发生的是先发送I帧,然后发送P帧,然后再发送B帧.解码器必须等待P帧,然后才能重建B帧.但是,这种解码方法更加健壮,原因是:

Therefore, what needs to happen is that you send I-frames first, then the P-frames, and then the B-frames after. The decoder has to wait for the P-frames before the B-frames can be reconstructed. However, this method of decoding is more robust because:

  • 它最大程度地减少了可能的未覆盖区域的问题.
  • P帧和B帧比I帧需要更少的数据,因此传输的数据更少.

但是,B帧将需要更多的运动矢量,因此这里会有更高的比特率.

However, B-frames will require more motion vectors, and so there will be some higher bit rates here.

老实说,我看到人们所做的是在一个帧和另一个帧之间做一个简单的平方和差,以比较相似性.您在一个位置的一帧中为每个像素获取颜色分量(是否为RGB,YUV等),在另一帧中的相同空间位置中将这些颜色分量与这些颜色分量相减,将每个分量平方并加在一起.您可以为框架中的每个位置累积所有这些差异.值越高,一帧和下一帧之间的差异就越大.

Honestly, what I have seen people do is do a simple Sum-of-Squared Differences between one frame and another to compare similarity. You take your colour components (whether it be RGB, YUV, etc.) for each pixel from one frame in one position, subtract these with the colour components in the same spatial location in the other frame, square each component and add them all together. You accumulate all of these differences for every location in your frame. The higher the value, the more dissimilar this is between the one frame and the next.

另一种众所周知的度量称为结构相似性,其中一些统计度量(例如均值)和方差用于评估两个帧的相似程度.

Another measure that is well known is called Structural Similarity where some statistical measures such as mean and variance are used to assess how similar two frames are.

使用了许多其他视频质量指标,使用这些指标中的任何一个都有其优缺点.与其告诉您要使用哪一个,我不建议您使用此 Wikipedia文章,以便您决定您可以根据自己的应用选择使用哪一种.这篇Wikipedia文章描述了很多相似性和视频质量指标,而且还不止于此.关于哪种数值测量最能捕捉两个框架之间的相似性和质量的研究仍在进行中.

There are a whole bunch of other video quality metrics that are used, and there are advantages and disadvantages when using any of them. Rather than telling you which one to use, I defer you to this Wikipedia article so you can decide which one to use for yourself depending on your application. This Wikipedia article describes a whole bunch of similarity and video quality metrics, and the buck doesn't stop there. There is still on-going research on what numerical measures best capture the similarity and quality between two frames.

从移动了P帧的I帧中搜索最佳块时,由于您不希望从此I帧块的位置开始将搜索限制在有限大小的窗口区域中编码器搜索框架中的所有位置.这将仅仅是计算量太大,从而使您的解码器变慢.我实际上在我们以前的文章中提到了这一点.

When searching for the best block from an I-frame that has moved in a P-frame, you need to restrict the searching to a finite sized windowed area from the location of this I-frame block because you don't want the encoder to search all of the locations in the frame. This would simply be too computationally intensive and would thus make your decoder slow. I actually mentioned this in our previous post.

使用一个像素在下一帧中搜索另一个像素是一个非常糟糕的主意,因为该单个像素包含的信息量很小.之所以在进行运动估计时一次比较块是因为通常,像素块在块内部具有很大的变化,这对于块本身来说是唯一的.如果我们可以在下一帧的另一个区域中找到相同的变化,那么这是一个很好的候选对象,该像素组一起移动到了这个新块.请记住,我们假设视频的帧速率足够高,因此帧中的大多数像素根本不会移动,或者移动得非常慢.使用块可以使匹配更加精确.

Using one pixel to search for another pixel in the next frame is a very bad idea because of the minuscule amount of information that this single pixel contains. The reason why you compare blocks at a time when doing motion estimation is because usually, blocks of pixels have a lot of variation inside the blocks which are unique to the block itself. If we can find this same variation in another area in your next frame, then this is a very good candidate that this group of pixels moved together to this new block. Remember, we're assuming that the frame rate for video is adequately high enough so that most of the pixels in your frame either don't move at all, or move very slowly. Using blocks allows the matching to be somewhat more accurate.

一次比较块,并且比较块的方式使用的是我在引用的Wikipedia文章中谈到的视频相似性指标之一.您肯定是正确的,因为这样做对于30帧确实会很慢,但是存在一些高度优化的实现可以非常快速地进行编码.一个很好的例子是 FFMPEG .实际上,我一直在使用FFMPEG. FFMPEG是高度可定制的,您可以创建利用系统体系结构的编码器/解码器.我已经设置好了,以便编码/解码使用我机器上的所有内核(总共8个).

Blocks are compared at a time, and the way blocks are compared is using one of those video similarity measures that I talked about in the Wikipedia article I referenced. You are certainly correct in that doing this for 30 frames would indeed be slow, but there are implementations that exist that are highly optimized to do the encoding very fast. One good example is FFMPEG. In fact, I use FFMPEG at work all the time. FFMPEG is highly customizable, and you can create an encoder / decoder that takes advantage of the architecture of your system. I have it set up so that encoding / decoding uses all of the cores on my machine (8 in total).

这并不能真正回答实际的块比较.实际上,H.264标准具有很多预测机制,因此您不必查看I帧中的所有块即可预测下一个P帧(或一个P -帧到下一个P帧,依此类推).这暗示了Wikipedia文章和我提到的论文中的不同预测模式.编码器足够智能,可以检测到图案,然后对图像的某个区域进行概括,使其认为会表现出相同的运动量.它会跳过此区域并移至下一个区域.

This doesn't really answer the actual block comparison itself. Actually, the H.264 standard has a bunch of prediction mechanisms in place so that you're not looking at all of the blocks in an I-frame to predict the next P-frame (or one P-frame to the next P-frame, etc.). This alludes to the different prediction modes in the Wikipedia article and in the paper that I referred you to. The encoder is intelligent enough to detect a pattern, and then generalize an area of your image where it believes that this will exhibit the same amount of motion. It skips this area and moves onto the next.

(我认为)这项任务范围太广.在进行运动预测/补偿时有太多的复杂性,因此大多数视频工程师已经在使用可用的工具来为我们完成工作是有原因的.为什么要在轮子已经完善的情况下发明轮子呢?

This assignment (in my opinion) is way too broad. There are so many intricacies in doing motion prediction / compensation that there is a reason why most video engineers already use available tools to do the work for us. Why invent the wheel when it has already been perfected, right?

我希望这已充分回答了您的问题.我相信我给您的问题多于真正的答案,但我希望这足以让您深入研究该主题以实现您的总体目标.

I hope this has adequately answered your questions. I believe that I have given you more questions than answers really, but I hope that this is enough for you to delve into this topic further to achieve your overall goal.

祝你好运!

这篇关于在Matlab中执行帧内预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆