寻求精通Unicode的功能来搜索二进制数据中的文本 [英] Seeking Unicode-savvy function for searching text in binary data

查看:108
本文介绍了寻求精通Unicode的功能来搜索二进制数据中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在二进制数据(文件)中找到unicode文本.

I need to find unicode text inside binary data (files).

我正在寻找可以在macOS上使用的任何C或C ++代码或库.由于我认为这对其他平台也有用,所以我宁愿使这个问题不特定于macOS.

I'm seeking any C or C++ code or library that I can use on macOS. Since I guess this is also useful to other platforms, so I rather make this question not specific to macOS.

在macOS上,不能使用NSString函数来满足我对unicode的精明需求,因为它们不适用于二进制数据.

On macOS, the NSString functions, meeting my unicode savvyness needs, can't be used because they do not work on binary data.

作为替代方案,我尝试了macOS上提供的符合POSIX的regex函数,但是它们有一些局限性:

As an alternative I've tried the POSIX complient regex functions provided on macOS, but they have some limitations:

  • 它们不是标准化专家,也就是说,如果我搜索一个预组合(NFC)字符,则如果目标数据中以分解(NFD)形式出现该字符,则找不到该字符.
  • 不区分大小写的搜索不适用于拉丁文NFC字符(搜索Ü找不到ü).

显示这些结果的示例代码如下.

Example code showing these results is below.

那里有什么代码或库可以满足这些需求?

What code or library is out there that fulfills these needs?

我不需要正则表达式功能,但是如果有一个可以满足这些要求的正则表达式库,我也很满意.

I do not need regex capabilities, but if there's a regex lib that can handle these requirements, I'm fine with that, too.

基本上,我需要使用以下选项进行Unicode文本搜索:

  • 不区分大小写
  • 对标准化不敏感
  • 对变音符号不敏感
  • 处理任意二进制数据,找到匹配的UTF-8文本片段
  • case-insensitive
  • normalization-insensitive
  • diacritics-insensitive
  • works on arbitrary binary data, finding matching UTF-8 text fragments

以下是测试代码,显示了在macOS上使用TRE regex实现的结果:

Here's the test code showing the results from using the TRE regex implementation on macOS:

#include <stdio.h>
#include <regex.h>

void findIn (const char *what, const char *data, int whatPre, int dataPre) {
    regex_t re;
    regcomp (&re, what, REG_ICASE | REG_LITERAL);
    int found = regexec(&re, data, 0, NULL, 0) == 0;
    printf ("Found %s (%s) in %s (%s): %s\n", what, whatPre?"pre":"dec", data, dataPre?"pre":"dec", found?"yes":"no");
}

void findInBoth (const char *what, int whatPre) {
    char dataPre[] = { '<', 0xC3, 0xA4, '>', 0};        // precomposed
    char dataDec[] = { '<', 0x61, 0xCC, 0x88, '>', 0};  // decomposed
    findIn (what, dataPre, whatPre, 1);
    findIn (what, dataDec, whatPre, 0);
}

int main(int argc, const char * argv[]) {
    char a_pre[] = { 0xC3, 0xA4, 0};        // precomposed ä
    char a_dec[] = { 0x61, 0xCC, 0x88, 0};  // decomposed ä
    char A_pre[] = { 0xC3, 0x84, 0};        // precomposed Ä
    char A_dec[] = { 0x41, 0xCC, 0x88, 0};  // decomposed Ä

    findInBoth (a_pre, 1);
    findInBoth (a_dec, 0);
    findInBoth (A_pre, 1);
    findInBoth (A_dec, 0);

    return 0;
}

输出为:

Found ä (pre) in <ä> (pre): yes
Found ä (pre) in <ä> (dec): no
Found ä (dec) in <ä> (pre): no
Found ä (dec) in <ä> (dec): yes
Found Ä (pre) in <ä> (pre): no
Found Ä (pre) in <ä> (dec): no
Found Ä (dec) in <ä> (pre): no
Found Ä (dec) in <ä> (dec): yes

期望的输出:所有情况都应为是"

Desired output: All cases should give "yes"

推荐答案

我已经解决了这个问题,方法是编写自己的前任代码,生成一个将所有交替项(大小写和规范化但不包括变音符)组合在一起的正则表达式,并通过到正则表达式功能.

I've solved the issue by writing my own pre-precessor, generating a regular expression that combines all the alternatices (case and normalization but not diacritics) and passing that to the regex function.

完整的解决方案是此处记录.

这篇关于寻求精通Unicode的功能来搜索二进制数据中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆