将 json 导入 Postgres 时的编码问题 [英] issue with encoding when importing json into Postgres

查看:61
本文介绍了将 json 导入 Postgres 时的编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Pandas,并将数据导出为 json,如下所示:

将pandas导入为pddf = pd.DataFrame({'a': ['Têst']})df.to_json(orient='records',lines=True)>u'{"a":"T\\u00east"}'

这是有道理的,因为我们有一个以 \u 为前缀的 Unicode 字符 00ea 并且在转换为 JSON 时用 \ 转义>

但随后我使用 COPY

将 JSON 字符串导入 Postgres

buffer = cStringIO.StringIO()buffer.write(df.to_json(orient='records',lines=True))缓冲区寻找(0)使用 connection.cursor() 作为游标:cursor.copy_expert(sql="""COPY tmp (json_data) from STDIN WITH NULL AS '' ENCODING 'UTF8';""", 文件=缓冲区)

问题是数据库中的结果最终是

{"a": "Tu00east"}

正如你所看到的那样,双 \\ 消失了.

我尝试使用 CSV 作为 COPY 模式,但由于某些数据中有逗号,因此将事情搞砸了,并且尝试将 ESCAPE 字符和 DELIMITER 设置为其他内容似乎总是会导致失败.

表列具有 jsonb 类型.我在文档中读到 PG 不喜欢 \x003f 上的非 ASCII Unicode,除非 DB 编码是 UTF8,就我而言,这应该不是问题.

我想弄清楚为什么这里要删除转义字符,以及如何导入 Postgres 并保存编码.

解决方案

COPY 使用 csv 选项,使用 DELIMITER e'\x01' QUOTE e'\x02'.我不确定这是否适用于所有可能的有效 JSON,但我从来没有失败过.

$ psql -X testdb -c '创建表 t(d jsonb)'创建表$猫foo.json{一个测试"}$ cat foo.json |psql -X testdb -c "COPY t from stdin csv delimiter e'\x01' quote e'\x02'"复制 1$ psql -X testdb -c 'select * from t';d---------------{一个测试"}(1 行)

I'm using pandas, and exporting data as json like this:

import pandas as pd
df = pd.DataFrame({'a': ['Têst']})

df.to_json(orient='records', lines=True)
> u'{"a":"T\\u00east"}'

This makes sense since we have a Unicode character 00ea prefixed with \u and it is escaped with \ when converted to JSON

But then I import the JSON strings into Postgres with COPY

buffer = cStringIO.StringIO()
buffer.write(df.to_json(orient='records', lines=True))
buffer.seek(0)

with connection.cursor() as cursor:
  cursor.copy_expert(sql="""
  COPY tmp (json_data) FROM STDIN WITH NULL AS '' ENCODING 'UTF8';
  """, file=buffer)

The problem is that the result in the database ends up being

{"a": "Tu00east"}

and as you can see the double \\ is gone.

I tried using CSV as the COPY mode, but it messes things up since there are commas in some of the data, and trying to set ESCAPE character and DELIMITER to something else always seem to cause failures.

The table column has a jsonb type. I read in the docs that PG doesn't like non-ASCII Unicode over \x003f unless the DB encoding is UTF8, which it is in my case, so that shouldn't be an issue.

I'm trying to figure out why the escaping characters are removed here, and how to import into Postgres and conserve the encoding.

解决方案

Use the csv option for COPY, with DELIMITER e'\x01' QUOTE e'\x02'. I'm not sure whether this works for all possible valid JSON, but I've never had it fail.

$ psql -X testdb -c 'create table t(d jsonb)'
CREATE TABLE
$ cat foo.json
{"a":"Têst"}
$ cat foo.json | psql -X testdb -c "COPY t from stdin csv delimiter e'\x01' quote e'\x02'" 
COPY 1
$ psql -X testdb -c 'select * from t';                                                    
       d       
---------------
 {"a": "Têst"}
(1 row)

这篇关于将 json 导入 Postgres 时的编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆