将 json 导入 Postgres 时的编码问题 [英] issue with encoding when importing json into Postgres
问题描述
我正在使用 Pandas,并将数据导出为 json,如下所示:
将pandas导入为pddf = pd.DataFrame({'a': ['Têst']})df.to_json(orient='records',lines=True)>u'{"a":"T\\u00east"}'
这是有道理的,因为我们有一个以 \u
为前缀的 Unicode 字符 00ea
并且在转换为 JSON 时用 \
转义>
但随后我使用 COPY
buffer = cStringIO.StringIO()buffer.write(df.to_json(orient='records',lines=True))缓冲区寻找(0)使用 connection.cursor() 作为游标:cursor.copy_expert(sql="""COPY tmp (json_data) from STDIN WITH NULL AS '' ENCODING 'UTF8';""", 文件=缓冲区)
问题是数据库中的结果最终是
{"a": "Tu00east"}
正如你所看到的那样,双 \\
消失了.
我尝试使用 CSV
作为 COPY 模式,但由于某些数据中有逗号,因此将事情搞砸了,并且尝试将 ESCAPE 字符和 DELIMITER 设置为其他内容似乎总是会导致失败.
表列具有 jsonb
类型.我在文档中读到 PG 不喜欢 \x003f 上的非 ASCII Unicode,除非 DB 编码是 UTF8,就我而言,这应该不是问题.
我想弄清楚为什么这里要删除转义字符,以及如何导入 Postgres 并保存编码.
对 COPY
使用 csv 选项,使用 DELIMITER e'\x01' QUOTE e'\x02'
.我不确定这是否适用于所有可能的有效 JSON,但我从来没有失败过.
$ psql -X testdb -c '创建表 t(d jsonb)'创建表$猫foo.json{一个测试"}$ cat foo.json |psql -X testdb -c "COPY t from stdin csv delimiter e'\x01' quote e'\x02'"复制 1$ psql -X testdb -c 'select * from t';d---------------{一个测试"}(1 行)
I'm using pandas, and exporting data as json like this:
import pandas as pd
df = pd.DataFrame({'a': ['Têst']})
df.to_json(orient='records', lines=True)
> u'{"a":"T\\u00east"}'
This makes sense since we have a Unicode character 00ea
prefixed with \u
and it is escaped with \
when converted to JSON
But then I import the JSON strings into Postgres with COPY
buffer = cStringIO.StringIO()
buffer.write(df.to_json(orient='records', lines=True))
buffer.seek(0)
with connection.cursor() as cursor:
cursor.copy_expert(sql="""
COPY tmp (json_data) FROM STDIN WITH NULL AS '' ENCODING 'UTF8';
""", file=buffer)
The problem is that the result in the database ends up being
{"a": "Tu00east"}
and as you can see the double \\
is gone.
I tried using CSV
as the COPY mode, but it messes things up since there are commas in some of the data, and trying to set ESCAPE character and DELIMITER to something else always seem to cause failures.
The table column has a jsonb
type. I read in the docs that PG doesn't like non-ASCII Unicode over \x003f unless the DB encoding is UTF8, which it is in my case, so that shouldn't be an issue.
I'm trying to figure out why the escaping characters are removed here, and how to import into Postgres and conserve the encoding.
Use the csv option for COPY
, with DELIMITER e'\x01' QUOTE e'\x02'
. I'm not sure whether this works for all possible valid JSON, but I've never had it fail.
$ psql -X testdb -c 'create table t(d jsonb)'
CREATE TABLE
$ cat foo.json
{"a":"Têst"}
$ cat foo.json | psql -X testdb -c "COPY t from stdin csv delimiter e'\x01' quote e'\x02'"
COPY 1
$ psql -X testdb -c 'select * from t';
d
---------------
{"a": "Têst"}
(1 row)
这篇关于将 json 导入 Postgres 时的编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!