如何在MySQL中查询包含亚洲语言字符的文本? [英] How can I query for text containing Asian-language characters in MySQL?

查看:63
本文介绍了如何在MySQL中查询包含亚洲语言字符的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用UTF-8字符集的MySQL表,该表具有一个称为longtext类型的WORDS的单列.该列中的值由用户输入,长度为几千个字符.

I have a MySQL table using the UTF-8 character set with a single column called WORDS of type longtext. Values in this column are typed in by users and are a few thousand characters long.

此表中有两种类型的行:

There are two types of rows in this table:

  1. 在某些行中,WORDS值由英语使用者编写,并且仅包含普通英语书写中使用的字符. (不一定全部都是ASCII,例如,可能会出现欧洲符号在某些情况下.)

其他行具有亚洲语言(韩语,中文,日语和其他语言)的讲者编写的WORDS值,其中包括英语单词和亚洲语言使用其本地逻辑字符的混合单词(而不是(例如,日语romaji).

Other rows have WORDS values written by speakers of Asian languages (Korean, Chinese, Japanese, and possibly others), which include a mix of English words and words in the Asian languages using their native logographic characters (and not, for example, Japanese romaji).

我如何编写一个查询,该查询将返回类型2的所有行,而不返回类型1的行?或者,如果很难,是否有办法查询最多这样的行(如果我错过了一些类型2的行,或者包括了一些类型1的误报,就可以了吗?)

How can I write a query that will return all the rows of type 2 and no rows of type 1? Alternatively, if that's hard, is there a way to query most such rows (here it's OK if I miss a few rows of type 2, or include a few false positives of type 1)?

更新:下面的评论建议我最好还是完全避免使用MySQL查询引擎,因为它对unicode的正则表达式支持听起来不太好.如果是这样,我可以将数据提取到文件中(使用mysql -B -e "some SQL here" > extract.txt),然后在文件上使用perl或类似文件.

Update: Comments below suggest I might do better to avoid the MySQL query engine altogether, as its regex support for unicode doesn't sound too good. If that's true, I could extract the data into a file (using mysql -B -e "some SQL here" > extract.txt) and then use perl or similar on the file. An answer using this method would be OK (but not as good as a native MySQL one!)

推荐答案

从理论上讲,您可以这样做:

In theory you could do this:

  1. 找到要测试的unicode范围.
  2. 将开始和结束手动编码为UTF-8.
  3. 将每个已编码的开始和结束的第一个字节用作REGEXP的范围.

我相信CJK范围已与欧元符号之类的东西相去甚远,误报和误报很少或为零.

I believe that the CJK range is far enough removed from things like the euro symbol that the false positives and false negatives would be few or none.

编辑:我们现在已将理论付诸实践!

We've now put theory into practice!

步骤1:选择字符范围.我建议\ u3000- \ u9fff;易于测试,应该可以给我们带来接近完美的结果.

Step 1: Choose the character range. I suggest \u3000-\u9fff; easy to test for, and should give us near-perfect results.

步骤2:编码为字节. (Wikipedia utf-8页面)

对于我们选择的范围,utf-8编码的值将始终为3个字节,其中第一个为1110xxxx,其中xxxx是unicode值的最高四位.

For our chosen range, utf-8 encoded values will always be 3 bytes, the first of which is 1110xxxx, where xxxx is the most significant four bits of the unicode value.

因此,我们要处理的字节范围为11100011至11101001,或0xe3至0xe9.

Thus, we want to mach bytes in the range 11100011 to 11101001, or 0xe3 to 0xe9.

步骤3:使用非常方便(我刚刚发现)的UNHEX函数进行正则表达式.

Step 3: Make our regexp using the very handy (and just now discovered by me) UNHEX function.

SELECT * FROM `mydata`
WHERE `words` REGEXP CONCAT('[',UNHEX('e3'),'-',UNHEX('e9'),']')

只是尝试了一下.奇迹般有效. :)

Just tried it out. Works like a charm. :)

这篇关于如何在MySQL中查询包含亚洲语言字符的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆