如何在MySQL中检查二进制字符串是否为UTF-8? [英] How can I check if a binary string is UTF-8 in mysql?
问题描述
我发现了一个Perl正则表达式,可以检查字符串是否为UTF-8(该正则表达式来自
I've found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
但是我不确定如何将其移植到MySQL,因为MySQL不支持字符的十六进制表示,请参见
But I'm not sure how to port it to MySQL as it seems that MySQL don't support hex representation of characters see this question.
任何人都想如何将regexp移植到MySQL? 或者,也许您知道其他任何方法来检查字符串是否为有效的UTF-8?
Any thoughts how to port the regexp to MySQL? Or maybe you know any other way to check if the string is valid UTF-8?
更新: 我需要在MySQL上进行此检查,因为我需要在服务器上运行它以更正损坏的表.我无法通过脚本传递数据,因为数据库大约为1TB.
UPDATE: I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can't pass the data through a script as the database is around 1TB.
推荐答案
我设法使用一种测试来修复数据库,该测试仅在您的数据可以使用一字节编码表示的情况下才有效,在我的情况下是latin1
I've managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.
我已经使用了mysql将不是utf-8的字节更改为'?'的事实.转换为latin1时.
I've used the fact that mysql changes the bytes that aren't utf-8 to '?' when converting to latin1.
这是支票的样子:
SELECT (
CONVERT(
CONVERT(
potentially_broken_column
USING latin1)
USING utf8))
!=
potentially_broken_column) AS INVALID ....
这篇关于如何在MySQL中检查二进制字符串是否为UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!