C ++中的短(ASCII,每个字符7位)字符串存储和比较优化 [英] Short (ASCII, 7-bit per character) string storage and comparison optimization in C++

查看:156
本文介绍了C ++中的短(ASCII,每个字符7位)字符串存储和比较优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的项目中,我正在使用大量的ASCII 7位短字符串集,并且必须以最佳性能处理(存储,比较,搜索等)这些字符串. 基本上,我建立了一些uint64_t类型的Index数组,每个元素存储一个单词的9个字符,并将该索引用作任何字符串比较操作的Numeric元素. 当前的实现工作速度很快,但是如果您愿意的话,可能可以对其进行一些改进.

In my project I'm using huge set of short strings in ASCII 7-bit and have to process (store, compare, search etc) these strings with maximum performance. Basically, I build some Index array of uint64_t type and each element stores 9 characters of a word and use that index as Numeric element for any string comparison operation. Current implementation works fast, but may be it's possible to improve it a bit if you will..

此函数最多将9个初始字符转换为uint64_t值-该数字的任何比较均等同于标准"strcmp"函数.

This function converts up to 9 initial characters to uint64_t value - any comparison of that number is equivalent of standard "strcmp" function.

#include <cstdint>
#include <iostream>

uint64_t cnv(const char* str, size_t len)
{
    uint64_t res = 0;

    switch (len)
    {
    default:
    case 9: res = str[8];
    case 8: res |= uint64_t(str[7]) << 7;
    case 7: res |= uint64_t(str[6]) << 14;
    case 6: res |= uint64_t(str[5]) << 21;
    case 5: res |= uint64_t(str[4]) << 28;
    case 4: res |= uint64_t(str[3]) << 35;
    case 3: res |= uint64_t(str[2]) << 42;
    case 2: res |= uint64_t(str[1]) << 49;
    case 1: res |= uint64_t(str[0]) << 56;
    case 0: break;
    }

    return res;
}

int main()
{
    uint64_t v0 = cnv("000", 3);
    uint64_t v1 = cnv("0000000", 7);

    std::cout << (v1 < v0);
}

推荐答案

您可以一次加载8个字节的原始字符串,而不是将它们压缩到结果整数中(如果您的计算机具有小端数字表示,则可以将它们反转)

You may load 8 bytes of an original string at once than condense them inside a resulting integer (and reverse them if your machine has a little-endian number representation).

#include <iostream>

uint64_t ascii2ulong (const char  *s, int len)
{
    uint64_t i = (*(uint64_t*)s);
    if (len < 8) i &= ((1UL << (len<<3))-1);
#ifndef BIG_ENDIAN
    i = (i&0x007f007f007f007fUL) | ((i & 0x7f007f007f007f00) >> 1);
    i = (i&0x00003fff00003fffUL) | ((i & 0x3fff00003fff0000) >> 2);
    i = ((i&0x000000000fffffffUL) << 7) | ((i & 0x0fffffff00000000) << (7-4));
    // Note: Previous line: an additional left shift of 7 is applied
    // to make room for s[8] character
#else
    i = ((i&0x007f007f007f007fUL) << 7)  | ((i & 0x7f007f007f007f00) >> 8);
    i = ((i&0x00003fff00003fffUL) << 14) | ((i & 0x3fff00003fff0000) >> 16);
    i = ((i&0x000000000fffffffUL) << (28+7)) | ((i & 0x0fffffff00000000) >> (32-7));
#endif

    if (len > 8) i |= ((uint64_t)s[8]);
    return i;
}


//Test
std::string ulong2str(uint64_t compressed) {
    std::string s;
    for (int i = 56; i >= 0; i-=7) 
        if (char nxt=(compressed>>i)&0x7f) s+= nxt;
    return s;
}
int main() {
    std::cout << ulong2str(ascii2ulong("ABCDEFGHI", 9))<<std::endl;
    std::cout << ulong2str(ascii2ulong("ABCDE", 5))<<std::endl;
    std::cout << (ascii2ulong("AB", 2) < ascii2ulong("B", 1))<<std::endl;
    std::cout << (ascii2ulong("AB", 2) < ascii2ulong("A", 1))<<std::endl;
    return 0;
}

但是请注意:这样做会正式违反分配的地址范围(如果原始字符串分配的< 8个字节).如果运行带有内存完整性检查的程序,则可能会产生运行时错误.为避免这种情况,您当然可以使用memcpy复制所需的字节数来代替uint64_t i = (*(uint64_t*)s);:

But note: doing in such a way you formally violate allocated address ranges (if your original string has < 8 bytes allocated). If you run a program with memory sanity checking, it may produce a runtime error. To avoid this you may of course use memcpy to copy as many bytes as you need in place of uint64_t i = (*(uint64_t*)s);:

uint64_t i;
memcpy(&i,s,std::min(len,8));

如果在您的计算机上为memcpy使用了某种硬件加速(可能),则在效率方面可能还不错.

If some hardware acceleration is used for memcpy at you machine (which is likely) it may be not bad in terms of efficiency.

这篇关于C ++中的短(ASCII,每个字符7位)字符串存储和比较优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆