从图形缓冲区的memcpy是Android中慢 [英] memcpy from graphic buffer is slow in Android

查看:487
本文介绍了从图形缓冲区的memcpy是Android中慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想捕捉视频的每一帧渲染的Andr​​oid设备,如的Nexus 10。据我所知之前进行一些修改,Android使用的硬件脱code和呈现在特定设备的框架,所以我应该得到的帧数据GraphicBuffer,之前渲染的数据将YUV格式。

I want to capture every frame from a video to make some modification before rendering in Android device, such as Nexus 10. As I know, android uses hardware to decode and render the frame in the specific device, so I should get the frame data from GraphicBuffer, and before rendering the data will be YUV format.

此外,我写AwesomePlayer.cpp一个静态方法来实现,捕捉数据帧/修改帧/写回GraphicBuffer呈现。

Also I write a static method in AwesomePlayer.cpp to implement that capture frame data / modify the frame / write it back into GraphicBuffer to render.

下面是我的演示code

Here is my demo code

static void handleFrame(MediaBuffer *buffer) {

    sp<GraphicBuffer> buf = buffer->graphicBuffer();

    size_t width = buf->getWidth();
    size_t height = buf->getHeight();
    size_t ySize = buffer->range_length();
    size_t uvSize = width * height / 2;

    uint8_t *yBuffer = (uint8_t *)malloc(ySize + 1);
    uint8_t *uvBuffer = (uint8_t *)malloc(uvSize + 1);
    memset(yBuffer, 0, ySize + 1);
    memset(uvBuffer, 0, uvSize + 1);

    int const *private_handle = buf->handle->data;

    void *yAddr = NULL;
    void *uvAddr = NULL;

    buf->lock(GRALLOC_USAGE_SW_READ_OFTEN | GRALLOC_USAGE_SW_WRITE_OFTEN, &yAddr);
    uvAddr = mmap(0, uvSize, PROT_READ | PROT_WRITE, MAP_SHARED, *(private_handle + 1));

    if(yAddr != NULL && uvAddr != NULL) {

      //memcpy data from graphic buffer
      memcpy(yBuffer, yAddr, ySize);
      memcpy(uvBuffer, uvAddr, uvSize);

      //modify the YUV data

      //memcpy data into graphic buffer
      memcpy(yAddr, yBuffer, ySize);
      memcpy(uvAddr, uvBuffer, uvSize);
    }

    munmap(uvAddr, uvSize);
    buf->unlock();

    free(yBuffer);
    free(uvBuffer);

}

我打印时间戳memcpy函数,我意识到的从GraphicBuffer的memcpy 花费更多的时间比 memcpy的数据转换成GraphicBuffer 。采取与1920×1080的分辨率,例如视频, 从GraphicBuffer的memcpy 大约需要30毫秒,这是不能接受正常的视频播放。

I printed the timestamp for memcpy function, and I realized that memcpy from GraphicBuffer takes much more time than memcpy data into GraphicBuffer. Take the video with resolution 1920x1080 for example, memcpy from GraphicBuffer takes about 30ms, it is unacceptable for normal video play.

我不知道为什么需要这么多的时间,也许是将数据从GPU缓存,但 中的数据复制到GraphicBuffer 看起来正常。

I have no idea why it takes so much time, maybe it copies data from GPU buffer, but copy data into GraphicBuffer looks normal.

难道其他人谁是熟悉的硬件去code在机器人来看看这个问题? 非常感谢。

Could anyone else who is familiar with hardware decode in android take a look at this issue? Thanks very much.

更新: 我发现,我没有使用GraphicBuffer得到YUV数据,我只是用硬件去code视频源和存储的YUV数据到内存中,这样我就可以得到直接从内存中的YUV数据,这是非常快速。 其实你可以发现在AOSP源$ C ​​$ C或开源的视频显示应用程序的类似的解决方案。我刚刚分配内存的缓冲区,而不是图形缓冲区,然后用硬件去codeR。 样品code在AOSP:框架/ AV /命令/ stagefright / SimplePlayer.cpp

Update: I found that I didn't have to use GraphicBuffer to get the YUV data, I just used hardware decode the video source and storage the YUV data to memory, so that I could get YUV data from memory directly, it's very fast. Actually you could found the similar solution in AOSP source code or open source video display app. I just allocate the memory buffers rather than graphic buffers, and then use the hardware decoder. Sample code in AOSP: frameworks/av/cmds/stagefright/SimplePlayer.cpp

链接:https://github.com/xdtianyu/android-4.2_r1/tree/master/frameworks/av/cmds/stagefright

推荐答案

最有可能的,从你的CPU到显存中的数据路径(也称为数据总线)进行了优化。从显存,CPU的路径可能不会进行优化。优化可能包括不同速率的内部数据总线,1级或2高速缓存,和等待状态。

Most likely, the data path (a.k.a. databus) from your CPU to the graphics memory is optimized. The path from graphics memory to CPU may not be optimized. Optimizations may include different speed internal databus, level 1 or 2 cache, and wait-states.

在电子设备(硬件)已设定的最大速度从图形存储器将数据传输到你的CPU。 CPU的内存可能比你的显存速度较慢,所以有可能是等待状态,以便在显存搭配CPU内存的速度较慢。

The electronics (hardware) has set the maximum speed for transferring data from the Graphics Memory to your CPU. The memory of the CPU is probably slower than your graphics memory, so there may be wait-states in order for the Graphics Memory to match the slower speed of the CPU memory.

的另一个问题是所有共享数据总线的设备。想象一下,城市之间共享的公路。为了优化交通,交通只允许一个方向。交通信号或人,监察交通。为了从市一去市℃,人们必须等到交通信号或董事,清晰的剩余流量,并给该路由城A到C城的优先级。在硬件方面,这就是所谓的总线仲裁的。

Another issue is all the devices sharing the data bus. Imagine a shared highway between cities. To optimize traffic, traffic is only allowed one direction. Traffic signals or people, monitor the traffic. In order to go from City A to City C, one has to wait until the traffic signals or director, clear remaining traffic and give the route City A to City C priority. In hardware terms, this is called Bus Arbitration.

在大多数平台上,CPU被转移寄存器和CPU内存之间的数据。这是需要读取和程序编写的变量。传送数据的慢路径是CPU读取存储器到寄存器,然后写入显存。一种更有效的方法是将传送数据而不使用CPU。可能存在的装置,DMA(直接存储器存取),它可以传输数据,而无需使用CPU。你告诉它的源和目标存储位置,然后启动它。它将传输数据,而不使用CPU。

In most platforms, the CPU is transferring data between registers and the CPU memory. This is needed to read and write your variables in your program. The slow route of transferring data is for the CPU to read memory into a register, then write to the Graphics Memory. A more efficient method is to transfer the data without using the CPU. There may exist a device, DMA (Direct Memory Access), which can transfer data without using the CPU. You tell it the source and target memory locations, then start it. It will transfer the data without using the CPU.

不幸的是,一DMA必须与CPU共享数据总线。这意味着,您的数据传输将通过为数据总线的CPU的任何请求减缓。它仍比使用CPU来传输数据作DMA可以传送数据,同时CPU正在执行不要求数据总线指令更快。

Unfortunately, the DMA must share the data bus with the CPU. This means that your data transfer will be slowed by any requests for the data bus by the CPU. It will still be faster than using the CPU to transfer the data as the DMA can be transferring the data while the CPU is executing instructions that don't require the data bus.

摘要
你的内存传输可能会很慢,如果你没有一个DMA设备。使用或不使用DMA时,数据总线通过仲裁多个设备和流量共享。这为传输数据的最大整体速度。存储芯片的数据传输速度还可以向数据传输速率。硬件的角度来看,是有极限速度。

Summary
Your memory transfers may be slow if you don't have a DMA device. With or without the DMA, the data bus is shared by multiple devices and traffic arbitrated. This sets the maximum overall speed for transferring data. Data transfer speeds of the memory chips may also contribute to the data transfer rate. Hardware-wise, there is a speed limit.

优化
1.使用的DMA,如果可能的话。
2.如果仅使用CPU,具有CPU传送的最大数据块的可能。
    这意味着使用说明书专门用于复制内存。​​
3.如果您的CPU没有专门的复制指令,使用处理器的字长转移。
    如果处理器有32位字,转移4个字节在1字,而不是使用4- 8位拷贝的时间。
4.减少CPU需求和中断的传输过程中。
    暂停所有应用程序;禁止中断如果可能的话。
5.划分的功夫:有,而另一个核心执行程序一个核心传输数据
6.线程在单核实际上可能降低传输,作为操作系统介入,因为调度。线程切换需要时间这增加了传输时间。

Optimizations
1. Use the DMA, if possible.
2. If only using CPU, have CPU transfer the largest chunks possible.
This means using instructions specifically for copying memory.
3. If your CPU doesn't have specialized copy instructions, transfer using the word size of the processor.
If the processor has 32-bit words, transfer 4 bytes at a time with 1 word rather than using 4 8-bit copies.
4. Reduce CPU demands and interruptions during the transfer.
Pause any applications; disable interrupts if possible.
5. Divide the effort: Have one core transfer the data while another core is executing your program.
6. Threading on a single core may actually slow the transfer, as the OS gets involved because of scheduling. The thread switching takes time which adds to the transfer time.

这篇关于从图形缓冲区的memcpy是Android中慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆