BigQuery REGEXP_MATCH和重音:边界通配符失败? [英] BigQuery REGEXP_MATCH and accents : boundary wildcard fails?

查看:158
本文介绍了BigQuery REGEXP_MATCH和重音:边界通配符失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在GAS中,我可以正确匹配具有边界字符的正则表达式的重音符号,例如\bà\b。字符à仅在它是单独的单词时才匹配。这在GAS中有效:

  function test_regExp(){
var str =laséanceestàParis;
var RegExp =\\bà\\b;
var PatReg = new RegExp(RegExp);
var found = PatReg.exec(str);
if(found){
Logger.log([str.substring(0,found.index),found [0],str.substring(found [0] .length + found.index)] );
} else Logger.log(oops!Did not match);

在BigQuery中,如果边界字符位于重音旁边,则模式不匹配。 \bséance\bmatchséance:

  SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE(REGEXP_MATCH (ftext,\\\\\\\\\))LIMIT 100; 

\bà\b与单词不匹配:

  SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE(REGEXP_MATCH(ftext,\\bà\\b) )极限100; 

我假定BigQuery与GAS不同,它在边界字符集中包含重音符号。因此\bséance\b可以工作,因为é可以在该配置中作为边界正常工作。 \bà\b或\\\\\\\\\\\\\\\\\\\\\\\\\\或\\ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ (好吧,我在这里抓着吸管,因为我找不到更好的解释......除了一个错误。)



我不认为它是一个unicode问题,因为它只会在边界位置出现。



因此,现在没有办法在这些特定的重音配置中使用边界。



有没有在BigQuery或其他修复中设置语言环境的方法?



解决方法:替换(?:[^ a-zA-Zéàïëëê])等等\ b。



谢谢! BigQuery的行为在 RE2语法文档。 (毫不奇怪,因为BigQuery使用RE2来实现正则表达式。)



RE2的角色类是:

  \b =在单词边界处(\ w在一边,\ W,\ A或\ z在另一边)
\w =单词字符(≡ [0-9A-Za-z_])
\W =不是单词字符(≡[^ 0-9A-Za-z_])
\A =文本开始处
\\ \\ z =文本结尾

换句话说,您只能使用\b来匹配非重音字符。然而,RE2有很多对Unicode字符的支持,所以你很可能使用类似\pL的技术制作替代的正则表达式。



我不确定为什么Google Apps脚本不遵循这里的RE2规范,但我会跟进该团队,弄清楚发生了什么。


In GAS I can correctly match accents with regular expression having boundary characters, such as \bà\b. The character à is matched only when it is a separate word. This works in GAS:

function test_regExp() {
  var str = "la séance est à Paris";
  var RegExp = "\\bà\\b";
  var PatReg= new RegExp( RegExp);
  var found=PatReg.exec(str);
  if (found) {
    Logger.log( [str.substring(0,found.index),found[0],str.substring(found[0].length+found.index)] );
  } else Logger.log("oops! Did not match");

In BigQuery, if boundary characters are next to accents the patterns do not match. \bséance\b matches séance:

SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE (REGEXP_MATCH(ftext,"\\bséance\\b") ) LIMIT 100;

\bà\b does not match à as a word:

SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE (REGEXP_MATCH(ftext,"\\bà\\b") ) LIMIT 100;

I'm assuming that BigQuery, unlike GAS, is including accents in the boundary character set. So \bséance\b works because é can function properly as a boundary in that configuration. \bà\b or \bétranger\b or \bmarché\b do not work because accent + \b is interpreted as \b\b, which never matches anything. (Ok, I'm grasping at straws here, because I can't find a better explanation....besides a bug.)

I don't think it is a unicode problem, because it only crops up at boundary positions.

For the moment therefore, no way to use boundary in those particular configurations of accents.

Is there a way to set the Locale in BigQuery or other fix?

Workaround: substitute (?:[^a-zA-Zéàïëâê]) and so on for \b.

Thanks!

解决方案

BigQuery's behavior is correct with respect to the RE2 syntax documentation. (No surprise, because BigQuery uses RE2 to implement regexps.)

RE2's character classes are:

\b = at word boundary (\w on one side and \W, \A, or \z on the other)
\w = word characters (≡ [0-9A-Za-z_])
\W = not word characters (≡ [^0-9A-Za-z_])
\A = beginning of text
\z = end of text

In other words, you can only use \b to match boundaries of non-accented characters. RE2 has plenty of support for Unicode characters, though, so you can most likely craft an alternative regexp using something like \pL.

I'm not sure why Google Apps Script doesn't follow the RE2 spec here, but I'll follow up with that team to figure out what's going on.

这篇关于BigQuery REGEXP_MATCH和重音:边界通配符失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆