从字符串替换非 UTF-8 [英] Replace non UTF-8 from String

查看:40
本文介绍了从字符串替换非 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,其中包含非 UTF-8 字符的字符串,例如 .我需要更改它们,以便它们恢复所有重音符号和其他拉丁字符,例如:cap 到 capó.该字段是一个 VARCHAR.

I have a table that has strings with non UTF-8 characters, like . I need to change them in order they have back all accents, and other latin characters, like: cap� to capó. The field is a VARCHAR.

到目前为止,我已经尝试过:SELECT "Column Name", regexp_replace("Column Name", '[^\w]+','') FROM table

So far, I have tried:SELECT "Column Name", regexp_replace("Column Name", '[^\w]+','') FROM table

还有:CONVERT("Column Name", 'UTF8', 'LATIN1') 但根本不起作用.

And: CONVERT("Column Name", 'UTF8', 'LATIN1') but don't work at all.

例如,我得到的错误是:Regexp 遇到无效的 UTF-8 字符 (...)"

For instance, the error I get is: "Regexp encountered an invalid UTF-8 character (...)"

我看过其他解决方案,但我无法继续使用它们,因为我不是管理员,无法更改表.

I have seen other solutions, but I can't go on them because I cannot change the table because I am not administrator.

有什么办法可以做到这一点吗?

Is there any whay to achieve this?

推荐答案

如果数据库编码为 UTF8,则所有您的字符串将包含 UTF8 字符.他们只是碰巧与您想要的不同.

If the database encoding is UTF8, then all your strings will contain only UTF8 characters. They just happen to be different characters than you want.

首先,您必须找出字符串中的字符.在你展示的情况下, �是 Unicode 代码点 FFFD(十六进制).

First, you have to find out what characters are in the strings. In the case you show, � is Unicode codepoint FFFD (in hexadecimal).

所以你可以使用 PostgreSQL 中的 replace 函数将它替换为 ó(Unicode 代码点 F3)像这样:

So you could use the replace function in PostgreSQL to replace it with ó (Unicode code point F3) like this:

SELECT replace(mycol, E'\uFFFD', E'\u00f3') FROM mytab;

这里使用了 PostgreSQL 的 Unicode 字符字面量语法;不要忘记使用 E 为所有字符串加上转义前缀以扩展字符串文字语法.

This uses the Unicode character literal syntax of PostgreSQL; don't forget to prefix all strings with escapes in them with E for extended string literal syntax.

这个角色很可能不是真正的�,因为那是“REPLACEMENT CHARACTER”常用于表示不可表示的字符.

There are odds that the character is not really �, because that is the “REPLACEMENT CHARACTER” often used to represent characters that are not representable.

在这种情况下,使用 psql 并运行这样的查询来显示字段的十六进制 UTF-8 内容:

In that case, use psql and run a query like this to display the hexadecimal UTF-8 contents of your fields:

SELECT mycol::bytea FROM mytab WHERE id = 12345;

从字符的 UTF-8 编码中,您可以推断出它到底是什么字符,并在您对 replace 的调用中使用它.

From the UTF-8 encoding of the character you can deduce what character it really is and use that in your call to replace.

如果您有多个字符,则需要多次调用 replace 来翻译它们.

If you have several characters, you will need several calls to replace to translate them all.

这篇关于从字符串替换非 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆