如何在 Rust 中折叠字符串? [英] How can I case fold a string in Rust?

查看:48
本文介绍了如何在 Rust 中折叠字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个简单的全文搜索库,需要大小写折叠来检查两个单词是否相等.对于这个用例,现有的 .to_lowercase().to_uppercase() 方法不够.

I'm writing a simple full text search library, and need case folding to check if two words are equal. For this use case, the existing .to_lowercase() and .to_uppercase() methods are not enough.

通过快速搜索 crates.io,我可以找到用于规范化和分词的库,但不能找到大小写折叠的库.regex-syntax 确实有案例折叠代码,但它并未在其 API 中公开.

From a quick search of crates.io, I can find libraries for normalization and word splitting but not case folding. regex-syntax does have case folding code, but it's not exposed in its API.

推荐答案

对于我的用例,我找到了 无壳板条箱最有用.

For my use case, I've found the caseless crate to be most useful.

据我所知,这是唯一支持规范化的库.这在您需要时很重要,例如"㎒" (U+3392 SQUARE MHZ) 和 "mhz" 匹配.有关如何使用的详细信息,请参阅 Unicode 标准中的第 3 章 - 默认无大小写匹配这有效.

As far as I know, this is the only library which supports normalization. This is important when you want e.g. "㎒" (U+3392 SQUARE MHZ) and "mhz" to match. See Chapter 3 - Default Caseless Matching in the Unicode Standard for details on how this works.

下面是一些不区分大小写匹配字符串的示例代码:

Here's some example code that matches a string case-insensitively:

extern crate caseless;
use caseless::Caseless;

let a = "100 ㎒";
let b = "100 mhz";

// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));

直接获取大小写折叠的字符串,可以使用default_case_fold_str函数:

To get the case folded string directly, you can use the default_case_fold_str function:

let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");

Caseless 也不会公开相应的标准化函数,但您可以使用 unicode- 编写一个规范化板条箱:

Caseless doesn't expose a corresponding function that normalizes as well, but you can write one using the unicode-normalization crate:

extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;

fn compatibility_case_fold(s: &str) -> String {
    s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}

let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");

请注意,正确的结果需要多轮归一化和大小写折叠.

Note that multiple rounds of normalization and case folding are needed for a correct result.

(感谢 BurntSushi5 将我指向这个库.)

(Thanks to BurntSushi5 for pointing me to this library.)

这篇关于如何在 Rust 中折叠字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆