字符串匹配和转换为小写中的lower() vs. casefold() [英] lower() vs. casefold() in string matching and converting to lowercase

查看:29
本文介绍了字符串匹配和转换为小写中的lower() vs. casefold()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我该如何做案例-不敏感的字符串比较?

根据我从谷歌和上面的链接了解到的两个函数:lower()casefold() 会将字符串转换为小写,但 casefold() 甚至会将无大小写的字母(例如德语中的 ß )转换为 ss.

From what I understood from Google and the link above that both functions: lower() and casefold() will convert the string to lowercase, but casefold() will convert even the caseless letters such as the ß in German to ss.

所有这些都是关于希腊字母的,但我的问题是:

All of that about Greek letters, but my question in general:

  • 还有其他区别吗?
  • 哪个更适合转换为小写?
  • 哪个更适合检查匹配的字符串?

第 2 部分:

firstString = "der Fluß"
secondString = "der Fluss"

# ß is equivalent to ss
if firstString.casefold() == secondString.casefold():
    print('The strings are equal.')
else:
    print('The strings are not equal.')

在上面的例子中我应该使用:

In the example above should I use:

lower() # 结果不相等,这对我来说很有意义

或者:

casefold() # which ß is ss and result is the
        # strings are equal. (since I am a beginner that still does not
        # make sense to me. I see different strings).

推荐答案

TL;DR

  • 纯 ASCII 文本 ->lower()
  • Unicode 文本/用户输入 ->casefold()

Casefolding 是 lower() 的一个更激进的版本,它的设置是为了使许多更独特的 unicode 字符更具可比性.这是另一种规范化文本的形式,最初可能看起来非常不同,但它考虑了许多不同语言的字符.

Casefolding is a more aggressive version of lower() that is set up to make many of the more unique unicode characters more comparable. It is another form of normalizing text that may initially appear to be very different, but it takes characters of many different languages into account.

我建议您仔细研究一下 case 折叠实际上是什么,所以这是一个好的开始:W3案例折叠维基

I suggest you take a closer look into what case folding actually is, so here's a good start: W3 Case Folding Wiki

要回答您的另外两个问题,如果您严格使用英语工作,lower()casefold() 应该产生完全相同的结果.但是,如果您尝试规范化来自其他语言的文本,而这些语言使用的不仅仅是我们简单的 26 个字母字母表(仅使用 ASCII),我将使用 casefold() 来比较您的字符串,因为它会产生更一致的结果.

To answer your other two questions, if you are working strictly in the English language, lower() and casefold() should be yielding exactly the same results. However, if you are trying to normalize text from other languages that use more than our simple 26-letter alphabet (using only ASCII), I would use casefold() to compare your strings, as it will yield more consistent results.

另一个来源:Elastic.co 案例折叠

我最近发现了另一个非常好的相关答案一个稍微不同的问题在这里(做一个不区分大小写的字符串比较)

I just recently found another very good related answer to a slightly different question here on SO (doing a case-insensitive string comparison)

另一个几个月来,@Voo 的评论一直在我的脑海里回荡,所以这里有一些进一步的想法:

Another @Voo's comments have been bouncing around in the back of my mind for a few months, so here are some further thoughts:

正如 Voo 所提到的,没有任何语言从不使用标准 ASCII 值之外的文本.这几乎就是 Unicode 存在的原因.考虑到这一点,在用户输入的任何可以包含非 ascii 值的内容上使用 casefold() 对我来说更有意义.这最终可能会排除一些可能来自严格处理 ASCII 的数据库的文本,但是,一般来说,可能大多数用户输入将使用 casefold() 处理,因为它具有正确的逻辑去大写所有字符.

As Voo mentioned, there aren't any languages that never use text outside the standard ASCII values. That's pretty much why Unicode exists. With that in mind, it makes more sense to me to use casefold() on anything that is user-entered that can contain non-ascii values. This might end up excluding some text that might come from a database that strictly deals with ASCII, but, in general, probably most user input would be dealt with using casefold() because it has the logic to properly de-uppercase all of the characters.

另一方面,已知生成到 ASCII 字符空间的值(如十六进制 UUID 或类似的东西)应该使用 lower() 进行标准化,因为它是一个简单得多的转换.简单地说,lower() 将需要更少的内存或更少的时间,因为没有查找,而且它只需要处理 26 个需要转换的字符.此外,如果您知道您的信息来源来自 CHARVARCHAR(SQL Server 字段)数据库字段,则同样可以使用 lower 因为Unicode 字符不能输入到这些字段中.

On the other hand, values that are known to be generated into the ASCII character space like hex UUIDs or something like that should be normalized with lower() because it is a much simpler transformation. Simply put, lower() will require less memory or less time because there are no lookups, and it's only dealing with 26 characters it has to transform. Additionally, if you know that the source of your information is coming from a CHAR or VARCHAR (SQL Server fields) database field, you can similarly just use lower because Unicode characters can't be entered into those fields.

实际上,这个问题归结为了解您的数据来源,并且当您对用户输入的信息有疑问时,只需 casefold().

So really, this question comes down to knowing the source of your data, and when in doubt about your user-entered information, just casefold().

这篇关于字符串匹配和转换为小写中的lower() vs. casefold()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆