如何在标准SQL中使用Unicode规范删除音标(例如重音符号)? [英] How to Remove Diacritic Marks (such as Accents) using Unicode Normalization in Standard SQL?

查看:144
本文介绍了如何在标准SQL中使用Unicode规范删除音标(例如重音符号)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用新的 normalize 函数,例如:

How can we remove diacritic marks from strings in BigQuery using the new normalize function such as:

café

结果:

cafe

推荐答案

简短答案

在了解规范化的作用之后,这实际上非常简单:

The Short Answer

It's actually quite simple after you understand what normalize is doing:

WITH data AS(
  SELECT 'Ãâíüçãõ' AS text
)

SELECT
  REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result,
  REGEXP_REPLACE(NORMALIZE(text, NFKD), r'\pM', '') nfkd_result
FROM data

结果:

Row   nfd_result    nfkd_result  
1     Aaiucao       Aaiucao  

您可以使用选项"NFD"或"NFKD",并且在大多数情况下它应该可以工作(仍然您应该了解这两个选项之间的差异以更好地处理数据).

You can use either the options "NFD" or "NFKD" and, for the most part, it should work (still you should understand the differences between both options to better deal with your data).

normalize 的基本用途是将字符串中的所有unicode转换为其规范的等效形式(或兼容形式),以便我们具有等效的比较参考(现在理解这一点已经需要了解一些概念).

Basically what normalize does is it converts all unicodes in a string to its canonical equivalent (or compatible form) so that we have equivalent reference for comparisons (now understanding this already requires knowing some concepts).

问题在于,unicode不仅建立了数字(以U +表示的代码点)与其字形之间的映射,而且还建立了这些点之间如何相互作用的一些规则.

The point is, unicode not only establishes the mapping between numbers (their code points represented by U+) and their glyphs but also some rules of how these points might interact among themselves.

例如,让我们使用字形á.

For instance, let's take the glyph á.

该字符不只一个unicode.实际上,我们可以用U+00E1U+0061U+0301来表示它,这是a´的Unicode.

We don't have just one unicode for this character. We actually can represent it either like U+00E1 or like U+0061U+0301 which is the unicodes for a and ´.

是的! Unicode的定义方式使您可以组合字符和变音符号,并通过依次排序来表示它们的并集.

Yeap! Unicode is defined in a way such that you can combine characters and diacritics and represent their union by just ordering one after the other.

实际上,您可以使用在线转换器:

Unicode定义了这些类型的字符,它们可以结合使用变音符作为预置的字符.聪明而简单的主意:没有预组合的字符具有所谓的0(零)组合类;可合并的点接收正合并类(例如,´具有类230),该类用于声明应如何表示最终字形.

Unicode defines these types of characters that can combine themselves to diacritics as precomposed characters by using a clever and simple idea: characters that are not precomposed have what is called a 0 (zero) combining class; points that can combine receive a positive combining class (for instance, ´ has class 230) which is used to assert how the final glyph should be represented.

这很酷,但是最终会产生一个问题,解释了我们从一开始就一直在讨论的函数 normalize :如果我们读取了两个字符串,一个使用unicode U+0061U+0301,另一个使用U+00E1(均为á),它们应被视为等效!实际上,它是用不同方式表示的同一字形.

This is quite cool but ends up creating a problem which explains the function normalize we've been discussing since the beginning: if we read two strings, one with unicode U+0061U+0301 and other with U+00E1 (both á), they should be considered equivalent! In fact, it's the same glyph represented in different ways.

这正是normalize所做的. Unicode为每个字符定义了规范形式,因此,在进行规范化时,最终结果应该是这样的:如果我们有两个带有相同字形的不同代码点的字符串,我们仍然可以将两者视为相等.

This is precisely what normalize is doing. Unicode defines a canonical form for each character so that, when normalized, the end result should be such that if we have two strings with distinct code points for same glyph, we still can see both as equal.

好吧,我们基本上有两种主要的方式可以规范代码点:将组成不同的unicode变成一个(在我们的示例中,这是将U+0061U+0301转换为U+00E1)或我们可以分解(反之亦然,将U+00E1转换为U+0061U+0301).

Well, there are basically 2 main possibilities for how we can normalize code points: either composing different unicodes into just one (in our example this would be transforming U+0061U+0301 into U+00E1) or we can decompose (which would be the other way around, transforming U+00E1 into U+0061U+0301).

在这里您可以更清楚地看到它:

Here you can see it more clearly:

NF表示规范等效项. NFC表示检索规范的复合字符(联合); NFD相反,分解字符.

NF means the canonical equivalent. NFC means to retrieve the canonical composite character (united); NFD is the opposite, decomposes the character.

您可以使用此信息在BigQuery中玩转

You can use this information to play around in BigQuery:

WITH data AS(
  SELECT 'Amélie' AS text
)

SELECT
  text,
  TO_CODE_POINTS(NORMALIZE(text, NFC)) nfc_result,
  TO_CODE_POINTS(NORMALIZE(text, NFD)) nfd_result
FROM data

哪个结果:

请注意,nfd列还有一个代码点.到目前为止,您已经知道这是什么了:´e分开.

Notice the nfd column has one more code point. By now you already know what that is: ´ separated from the e.

如果您阅读BigQuery的 normalize 文档,您会发现它也支持NFKC和NFKD类型.这种类型(带有字母 K )不是通过规范对等进行规范化,而是通过兼容性"进行归一化,也就是说,它也将某些字符分解为其组成字母,不仅是变音符号:

If you read BigQuery's documentation for normalize, you'll see it also has support for the types NFKC and NFKD. This type (with letter K) does not normalize by canonical equivalence but rather by "compatibility", that is, it breaks some characters into its constituents letters as well, not only diacritics:

字母(与ffi不同.这种字符称为连字)也由构成它的字母分解(因此等价性消失了,因为ffi在某些应用中可能与ffi不同,因此使用名称兼容表).

The letter (which is not the same as ffi. This type of character is known as ligature) is decomposed also by the letters that constitutes it (and therefore equivalence is lost as ffi may not be the same as ffi for some applications, hence the name compatibility form).

现在我们知道如何将字符分解为主要字形,然后是变音符,我们可以使用regex来匹配它们,以从字符串中删除(这是通过与变音符匹配的\pM表达式实现的)仅):

Now that we know how to decompose characters into the main glyph followed by its diacritic, we can use a regex to match only them to remove from the string (which is accomplished by the expression \pM which matches diacritics marks only):

WITH data AS(
  SELECT 'café' AS text
)

SELECT
  REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result
FROM data

normalize 函数几乎就是(希望如此)以及如何用于删除变音符号.我找到所有这些信息,这要归功于用户 sigpwned 和他对

And that's pretty much all there is (hopefully) to the normalize function and how it's used to remove diacritics. All this information I found thanks to user sigpwned and his answer to this question. As I tried it and it didn't quite work I decided to study some of the theory behind the methods and wanted to write it down :). Hopefully it'll be useful for more people as it definitely was for me.

这篇关于如何在标准SQL中使用Unicode规范删除音标(例如重音符号)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆