使用Python正确的编码从Oracle导入 [英] Importing from Oracle using the correct encoding with Python

查看:102
本文介绍了使用Python正确的编码从Oracle导入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很抱歉提出一个字符编码问题,因为我知道你们每天都有很多人,但是我无法弄清楚我的问题,所以我还是问了.

I apologize for making a character encoding question since I know you folk get many everyday, but I couldn't figure out my problem so I asked anyway.

这是我们正在做的:

  1. 使用Python和cx_Oracle从Oracle数据库中获取数据.
  2. 使用Python将数据写入文件.
  3. 使用Python和psycopg2将文件提取到Postgres中.
  1. Take Data from an Oracle DB using Python and cx_Oracle.
  2. Write the data to a file using Python.
  3. Ingest the file into Postgres using Python and psycopg2.

以下是重要的Oracle设置:

Here are the important Oracle settings:

SQL> select * from NLS_DATABASE_PARAMETERS;

PARAMETER                      VALUE
------------------------------ ----------------------------------------
NLS_LANGUAGE                   AMERICAN
NLS_TERRITORY                  AMERICA
NLS_CURRENCY                   $
NLS_ISO_CURRENCY               AMERICA
NLS_NUMERIC_CHARACTERS         .,
NLS_CHARACTERSET               US7ASCII

根据此NLS_LANG 常见问题解答,则应根据客户端操作系统的使用情况来设置NLS_LANG.

According to this NLS_LANG faq, you are meant to set the NLS_LANG according to what your client OS is using.

运行locale会给我们:LANG=en_US.UTF-8(所有其他字段也是en_US.UTF-8).

Running locale gives us: LANG=en_US.UTF-8 (all of the other fields were also en_US.UTF-8).

因此,在我们的Python脚本中,我们将其设置如下:

So, in our Python script, we set it like this:

os.environ["NLS_LANG"] = "AMERICAN_AMERICA.AL32UTF8"

然后,我们导入数据并将其写入文件.

Then we import the data and write it to a file.

row = cur.fetchall()
fil.write(row[0][0]) #For this test, I am only writing one row and one field.

我们将该文件提取到我们的UTF-8 Postgres数据库中.

We ingest that file into our UTF-8 Postgres DB.

不幸的是,由于某种原因,我们得到了这个符号: 在我们的文件以及随后的PG表中.如果我的理解是正确的,则为替换字符.我相信,如果Unicode无法识别符号,则该字符将显示出来.

Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.

(在某些文本编辑器中,符号显示为�).

(In some text editors, the symbol shows up as �).

我不明白为什么会这样?我以为UTF-8向后兼容7位ASCII?

What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?

即使我们使用的是区域页面,由于客户端使用的是美国,而Oracle服务器使用的是AMERICAN,它仍然不能正常工作吗?

And even if we are using regional pages, shouldn't it still work, since the client is using US and the Oracle server is using AMERICAN?

如何检查数据是否正确导入和不正确,如何解决以便将来导入?

How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?

注意:Oracle字段是一个CHAR字段,而不是一个NCHAR字段.

Note: The Oracle field is a CHAR field and not a NCHAR field.

注2:我们正在使用Python 2.4,因此Python 3.X中没有本机Unicode内容.因此,尽管我以为cx_Oracle会处理所有事情,但Python还是有可能搞砸了.

Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.

感谢您的时间,希望您过得愉快.

Thanks for your time, I hope you have a good day.

推荐答案

不幸的是,由于某种原因,我们得到了这个符号: 在我们的文件以及随后的PG表中.如果我的理解是正确的,那就是替换字符.我相信,如果Unicode无法识别符号,则该字符将显示出来.

Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.

最正确,但不太正确.使用该编码时,PostgreSQL将拒绝插入非UTF8文本字符(在StackOverflow上搜索"Invalid UTF8 postgresql").您看到的字符很可能是您的 font 无法识别的有效UTF8字符,因此正在显示替换字符.如果该符号在您的Oracle数据库中,并且实际上是其中的替换符号,那么您要用什么替换它?如果是这种情况,则该信息已经丢失.

Mostly right but not quite. PostgreSQL will refuse to insert non-UTF8 text characters when using that encoding (do a search on StackOverflow for "Invalid UTF8 postgresql"). Most likely the character you are seeing is a valid UTF8 character that is not recognized by your font and therefore is showing the replacement character. If the symbol is in your Oracle db and is actually the replacement symbol there, then what do you want to replace it with? If that is the case, the information is already missing.

我不明白为什么会这样?我以为UTF-8向后兼容7位ASCII?

What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?

是的.

如何检查数据导入是否正确以及数据是否正确,如何解决以便将来导入?

How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?

您的问题很可能是Oracle数据库的上游.我会找出实际上是在Oracle数据库中插入问题数据并将其修复的原因.如果您可以对照Oracle中的数据检查Pg中的数据,则应该能够确定该数据的字符是否相同(并标记任何差异).这就是检查您当前导入的方法.

Most likely your problem is upstream of the Oracle db. I would find out what is actually inserting problem data into the Oracle db and fix it there. If you can check the data in Pg against the data in Oracle, you should be able to determine if the data is character for character the same (and flag any differences). That's how to check your current import.

注2:我们正在使用Python 2.4,因此Python 3.X中没有本机Unicode内容.因此,尽管我以为cx_Oracle会处理所有事情,但Python还是有可能搞砸了.

Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.

那是另一种可能性.就个人转换而言,我更喜欢Perl,因为它具有集成的正则表达式和绝对一流的PostgreSQL支持.但是,我知道此时您的导入例程可能不容易转换.我对Perl中的UTF8转换问题进行故障诊断比对Python更为熟悉.但我确实想知道您是否可以检查此类符号的二进制格式的数据.

That's another possibility. Personally for file transformations I prefer Perl because of integrated regular expressions and absolutely top rate PostgreSQL support. However I recognize your import routine may not be readily convertable at this point. I am a little more familiar with troubleshooting UTF8 conversion issues in Perl than in Python. I do wonder however if you can check the data that is coming out in binary format for such symbols.

这篇关于使用Python正确的编码从Oracle导入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆