类似"memcpy"的功能是否支持单个位的偏移量? [英] 'memcpy'-like function that supports offsets by individual bits?

查看:470
本文介绍了类似"memcpy"的功能是否支持单个位的偏移量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当时正在考虑解决此问题,但看起来这是一项艰巨的任务.如果我自己一个人学习,我可能会用几种不同的方式编写它并选择最好的方法,所以我想问这个问题,看看是否有一个好的图书馆已经解决了这个问题,或者是否有人有想法/建议.

I was thinking about solving this, but it's looking to be quite a task. If I take this one by myself, I'll likely write it several different ways and pick the best, so I thought I'd ask this question to see if there's a good library that solves this already or if anyone has thoughts/advice.

void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size)
{
    // Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer 
    // needs to be up to one byte longer than it would need to be in memcpy.
    // Maybe explicitly providing the end of the buffer is best.
    // Also note that pSrc has NO alignment assumptions at all.
}

我的应用程序对时间很紧迫,因此我希望以最小的开销来解决这个问题.这是困难/复杂性的根源.在我的情况下,块可能很小,可能是4到12个字节,因此大规模的memcpy内容(例如prefetch)并不是那么重要.最好的结果是,对于随机未对齐的src缓冲区,对于恒定的大小"输入(在4到12之间),基准测试最快.

My application is time critical so I want to nail this with minimal overhead. This is the source of the difficulty/complexity. In my case, the blocks are likely to be quite small, perhaps 4-12 bytes, so big-scale memcpy stuff (e.g. prefetch) isn't that important. The best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.

  • 应尽可能以字大小的块移动内存
  • 这些字大小的块的对齐很重要. pSrc是未对齐的,因此我们可能需要从前端读取一些字节,直到对齐为止.

有人知道或知道有类似的实现方法吗?还是有人想要刺痛于此,以使其尽可能干净和高效?

Anyone have, or know of, a similar implemented thing? Or does anyone want to take a stab at writing this, getting it to be as clean and efficient as possible?

似乎人们对此投票表示过于广泛".某些狭窄的细节可能是AMD64是首选的体系结构,因此让我们假设这一点.这意味着没有尾数等.该实现有望很好地满足一个答案的范围,因此我认为这不是太宽泛.我要求的答案是一次实现的,即使有几种方法也是如此.

It seems people are voting this "close" for "too broad". A few narrowing details would be AMD64 is the preferred architecture, so lets assume that. This means little endian etc. The implementation would hopefully fit well within the size of an answer so I don't think this is too broad. I'm asking for answers that are a single implementation at a time, even though there are a few approaches.

推荐答案

我将从这样的简单实现开始:

I would start with a simple implementation such as this:

inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size)
{
    if (srcBitOffset == 0)
    {
        for (size_t i = 0; i < size; ++i)
        {
            pDest[i] = pSrc[i];
        }
    }
    else if (size > 0)
    {
        uint8_t v0 = pSrc[0];
        for (size_t i = 0; i < size; ++i)
        {
            uint8_t v1 = pSrc[i + 1];
            pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset));
            v0 = v1;            
        }
    }
}

(警告:未经测试的代码!).

(warning: untested code!).

一旦运行,然后在应用程序中对其进行分析-您可能会发现它足够快地满足您的需求,从而避免了过早优化的陷阱.如果没有,那么您将获得有用的基准参考实现,以进行进一步的优化工作.

Once this is working then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further optimisation work.

请注意,对于小型副本,测试对齐和字大小的副本等开销可能远远超过了任何好处,因此,像上面这样的简单的逐字节循环可能很接近最优.

Be aware that for small copies the overhead of testing for alignment and word-sized copies etc may well outweigh any benefits, so a simple byte by byte loop such as the above may well be close to optimal.

还请注意,优化可能完全取决于体系结构-在一个CPU上受益的微优化可能在另一个CPU上适得其反.

Note also that optimisations may well be architecture-dependent - micro-optimisations which give a benefit on one CPU may well be counter-productive on another.

这篇关于类似"memcpy"的功能是否支持单个位的偏移量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆