memcpy从图形缓冲区在Android中很慢 [英] memcpy from graphic buffer is slow in Android

查看:1071
本文介绍了memcpy从图形缓冲区在Android中很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从视频中捕获每一帧,在Android设备中渲染之前进行一些修改,例如Nexus 10.我知道,Android使用硬件来解码和渲染特定设备中的帧,所以我应该得到来自GraphicBuffer的帧数据,并且在渲染之前数据将是YUV格式。

I want to capture every frame from a video to make some modification before rendering in Android device, such as Nexus 10. As I know, android uses hardware to decode and render the frame in the specific device, so I should get the frame data from GraphicBuffer, and before rendering the data will be YUV format.

同样,我在AwesomePlayer.cpp中写一个静态方法来实现捕获帧数据/修改帧/写回GraphicBuffer来渲染。

Also I write a static method in AwesomePlayer.cpp to implement that capture frame data / modify the frame / write it back into GraphicBuffer to render.

这是我的演示代码

static void handleFrame(MediaBuffer *buffer) {

    sp<GraphicBuffer> buf = buffer->graphicBuffer();

    size_t width = buf->getWidth();
    size_t height = buf->getHeight();
    size_t ySize = buffer->range_length();
    size_t uvSize = width * height / 2;

    uint8_t *yBuffer = (uint8_t *)malloc(ySize + 1);
    uint8_t *uvBuffer = (uint8_t *)malloc(uvSize + 1);
    memset(yBuffer, 0, ySize + 1);
    memset(uvBuffer, 0, uvSize + 1);

    int const *private_handle = buf->handle->data;

    void *yAddr = NULL;
    void *uvAddr = NULL;

    buf->lock(GRALLOC_USAGE_SW_READ_OFTEN | GRALLOC_USAGE_SW_WRITE_OFTEN, &yAddr);
    uvAddr = mmap(0, uvSize, PROT_READ | PROT_WRITE, MAP_SHARED, *(private_handle + 1));

    if(yAddr != NULL && uvAddr != NULL) {

      //memcpy data from graphic buffer
      memcpy(yBuffer, yAddr, ySize);
      memcpy(uvBuffer, uvAddr, uvSize);

      //modify the YUV data

      //memcpy data into graphic buffer
      memcpy(yAddr, yBuffer, ySize);
      memcpy(uvAddr, uvBuffer, uvSize);
    }

    munmap(uvAddr, uvSize);
    buf->unlock();

    free(yBuffer);
    free(uvBuffer);

}

我打印了memcpy函数的时间戳, memcpy from GraphicBuffer 需要比 memcpy数据到GraphicBuffer 更多的时间。以分辨率为1920x1080的视频为例, memcpy from GraphicBuffer 大约需要30秒,对于正常的视频播放是不能接受的。

I printed the timestamp for memcpy function, and I realized that memcpy from GraphicBuffer takes much more time than memcpy data into GraphicBuffer. Take the video with resolution 1920x1080 for example, memcpy from GraphicBuffer takes about 30ms, it is unacceptable for normal video play.

我不知道为什么需要这么多时间,也许是从GPU缓冲区复制数据,但将数据复制到GraphicBuffer 看起来很正常。

I have no idea why it takes so much time, maybe it copies data from GPU buffer, but copy data into GraphicBuffer looks normal.

任何熟悉硬件解码在Android中的人都可以看看这个问题?
非常感谢。

Could anyone else who is familiar with hardware decode in android take a look at this issue? Thanks very much.

更新:
我发现我没有使用GraphicBuffer YUV数据,我刚刚使用硬件解码视频源和存储YUV数据到内存,使我可以直接从内存获取YUV数据,它的速度非常快。
实际上,您可以在AOSP源代码或开源视频显示应用程序中找到类似的解决方案。我只是分配内存缓冲区而不是图形缓冲区,然后使用硬件解码器。
AOSP中的示例代码:frameworks / av / cmds / stagefright / SimplePlayer.cpp

Update: I found that I didn't have to use GraphicBuffer to get the YUV data, I just used hardware decode the video source and storage the YUV data to memory, so that I could get YUV data from memory directly, it's very fast. Actually you could found the similar solution in AOSP source code or open source video display app. I just allocate the memory buffers rather than graphic buffers, and then use the hardware decoder. Sample code in AOSP: frameworks/av/cmds/stagefright/SimplePlayer.cpp

链接: https://github.com/xdtianyu/android-4.2_r1/tree/master/frameworks/av/cmds stagefright

推荐答案

最有可能的是,从CPU到图形内存的数据路径(aka数据库)已优化。从图形内存到CPU的路径可能未优化。优化可以包括不同速度的内部数据总线,1级或2级缓存和等待状态。

Most likely, the data path (a.k.a. databus) from your CPU to the graphics memory is optimized. The path from graphics memory to CPU may not be optimized. Optimizations may include different speed internal databus, level 1 or 2 cache, and wait-states.

电子设备(硬件)设置了将数据从图形存储器传输到CPU的最大速度。 CPU的内存可能比图形内存慢,因此可能有等待状态,以使图形内存与CPU内存的速度相匹配。

The electronics (hardware) has set the maximum speed for transferring data from the Graphics Memory to your CPU. The memory of the CPU is probably slower than your graphics memory, so there may be wait-states in order for the Graphics Memory to match the slower speed of the CPU memory.

另一个问题是所有共享数据总线的设备。想象一下城市之间的共享高速公路。为了优化流量,流量只允许一个方向。交通信号灯或人,监控交通。为了从城市A到城市C,必须等待交通信号或导演,清除剩余的交通并且将城市A到城市C的路线优先。在硬件方面,这称为总线仲裁

Another issue is all the devices sharing the data bus. Imagine a shared highway between cities. To optimize traffic, traffic is only allowed one direction. Traffic signals or people, monitor the traffic. In order to go from City A to City C, one has to wait until the traffic signals or director, clear remaining traffic and give the route City A to City C priority. In hardware terms, this is called Bus Arbitration.

在大多数平台中,CPU在寄存器和CPU内存之间传输数据。这需要在程序中读取和写入变量。传输数据的缓慢路由是使CPU将存储器读入寄存器,然后写入图形存储器。一种更有效的方法是在不使用CPU的情况下传输数据。可能存在一种设备,DMA(直接存储器存取),其可以在不使用CPU的情况下传送数据。你告诉它的源和目标内存位置,然后启动它。它将在不使用CPU的情况下传输数据。

In most platforms, the CPU is transferring data between registers and the CPU memory. This is needed to read and write your variables in your program. The slow route of transferring data is for the CPU to read memory into a register, then write to the Graphics Memory. A more efficient method is to transfer the data without using the CPU. There may exist a device, DMA (Direct Memory Access), which can transfer data without using the CPU. You tell it the source and target memory locations, then start it. It will transfer the data without using the CPU.

不幸的是,DMA必须与CPU共享数据总线。这意味着CPU的数据总线请求将减慢您的数据传输速度。它仍然比使用CPU传输数据更快,因为DMA可以在CPU执行不需要数据总线的指令时传输数据。

Unfortunately, the DMA must share the data bus with the CPU. This means that your data transfer will be slowed by any requests for the data bus by the CPU. It will still be faster than using the CPU to transfer the data as the DMA can be transferring the data while the CPU is executing instructions that don't require the data bus.

摘要

如果您没有DMA设备,您的内存传输可能会很慢。使用或不使用DMA,数据总线由多个设备共享,并进行流量仲裁。这将设置传输数据的最大总速度。存储器芯片的数据传输速度也可以有助于数据传输速率。在硬件方面,有一个速度限制。

Summary
Your memory transfers may be slow if you don't have a DMA device. With or without the DMA, the data bus is shared by multiple devices and traffic arbitrated. This sets the maximum overall speed for transferring data. Data transfer speeds of the memory chips may also contribute to the data transfer rate. Hardware-wise, there is a speed limit.

优化

1.如有可能,使用DMA。

2.如果只有使用CPU,可以使CPU传输最大的块。

这意味着使用专门用于复制内存的指令。

3.如果CPU没有专门的复制指令,请使用处理器的字大小。

如果处理器具有32位字,则每次使用1个字传输4个字节,而不是使用4个8位副本。

4.减少CPU要求和传输期间的中断。

暂停任何应用程序;如果可能,禁用中断。

5.分离努力:有一个核心传输数据,而另一个核心正在执行程序。

6.单个核心上的线程可能会减慢传输,因为OS由于调度而涉及。线程切换需要时间,这增加了传输时间。

Optimizations
1. Use the DMA, if possible.
2. If only using CPU, have CPU transfer the largest chunks possible.
This means using instructions specifically for copying memory.
3. If your CPU doesn't have specialized copy instructions, transfer using the word size of the processor.
If the processor has 32-bit words, transfer 4 bytes at a time with 1 word rather than using 4 8-bit copies.
4. Reduce CPU demands and interruptions during the transfer.
Pause any applications; disable interrupts if possible.
5. Divide the effort: Have one core transfer the data while another core is executing your program.
6. Threading on a single core may actually slow the transfer, as the OS gets involved because of scheduling. The thread switching takes time which adds to the transfer time.

这篇关于memcpy从图形缓冲区在Android中很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆