在Postgresql中处理Unicode序列 [英] Handling Unicode sequences in postgresql

查看:1261
本文介绍了在Postgresql中处理Unicode序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgresql数据库(9.4.1)的JSON(不是JSONB)列中存储了一些JSON数据.其中一些JSON结构的属性值中包含unicode序列.例如:

I have some JSON data stored in a JSON (not JSONB) column in my postgresql database (9.4.1). Some of these JSON structures contain unicode sequences in their attribute values. For example:

{"client_id": 1, "device_name": "FooBar\ufffd\u0000\ufffd\u000f\ufffd" }

当我尝试查询此JSON列时(即使我没有直接尝试访问device_name属性),也会出现以下错误:

When I try to query this JSON column (even if I'm not directly trying to access the device_name attribute), I get the following error:

错误:不支持的Unicode转义序列
详细信息:\u0000无法转换为文本.

ERROR: unsupported Unicode escape sequence
Detail: \u0000 cannot be converted to text.

您可以通过在PostgreSQL服务器上执行以下命令来重新创建此错误:

You can recreate this error by executing the following command on a postgresql server:

select '{"client_id": 1, "device_name": "FooBar\ufffd\u0000\ufffd\u000f\ufffd" }'::json->>'client_id'

该错误对我来说很有意义-根本无法在文本结果中表示unicode序列NULL.

The error makes sense to me - there is simply no way to represent the unicode sequence NULL in a textual result.

我有什么方法可以查询相同的JSON数据,而不必对传入的数据执行清除"操作?这些JSON结构会定期更改,因此扫描特定属性(在本例中为device_name)将不是一个很好的解决方案,因为很可能存在其他可能包含相似数据的属性.

Is there any way for me to query the same JSON data without having to perform "sanitation" on the incoming data? These JSON structures change regularly so scanning a specific attribute (device_name in this case) would not be a good solution since there could easily be other attributes that might hold similar data.

经过更多调查后,似乎此行为对于9.4.1版本来说是

After some more investigations, it seems that this behavior is new for version 9.4.1 as mentioned in the changelog:

...因此,当需要转换为转义形式时,\u0000现在也将在json值中被拒绝.只要不对值进行任何处理,此更改都不会破坏将\u0000存储在json列中的能力...

...Therefore \u0000 will now also be rejected in json values when conversion to de-escaped form is required. This change does not break the ability to store \u0000 in json columns so long as no processing is done on the values...

这真的是意图吗?在这里降级到9.4.1之前是否可行?

Was this really the intention? Is a downgrade to pre 9.4.1 a viable option here?

附带说明一下,此属性取自客户端移动设备的名称-是用户将此文本输入到设备中的.用户到底是如何插入 NULL

As a side note, this property is taken from the name of the client's mobile device - it's the user that entered this text into the device. How on earth did a user insert NULL and REPLACEMENT CHARACTER values?!

推荐答案

\u0000是一个在字符串中无效的Unicode代码点.除了清除字符串,我没有其他方法.

\u0000 is the one Unicode code point which is not valid in a string. I see no other way than to sanitize the string.

由于json只是特定格式的字符串,因此您可以使用标准的字符串函数,而不必担心JSON结构.用于删除代码点的单行消毒器为:

Since json is just a string in a specific format, you can use the standard string functions, without worrying about the JSON structure. A one-line sanitizer to remove the code point would be:

SELECT (regexp_replace(the_string::text, '\\u0000', '', 'g'))::json;

但是您也可以插入自己喜欢的任何字符,如果将零代码点用作某种形式的定界符,这将很有用.

But you can also insert any character of your liking, which would be useful if the zero code point is used as some form of delimiter.

还要注意数据库中存储的内容与如何显示给用户之间的细微差别.您可以将代码点存储在JSON字符串中,但是在将值作为json数据类型处理之前,必须将其预处理为其他字符.

Note also the subtle difference between what is stored in the database and how it is presented to the user. You can store the code point in a JSON string, but you have to pre-process it to some other character before processing the value as a json data type.

这篇关于在Postgresql中处理Unicode序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆