为什么Python每两行写一个错误编码的行? [英] Why does Python write one wrongly encoded line every two lines?

查看:439
本文介绍了为什么Python每两行写一个错误编码的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将表列的内容从SQL Server 2K转储到文本文件中,我以后用Python处理并输出新的文本文件。



我的问题是,我无法让python使用正确的编码,而输入文件在我的文本编辑器上显示正常输出的字符明显是每两行一次。



我的python代码可以减少到:

  input = open('input','r')
string = input.read()
#做东西
output = open('output' 'w +')
output.write(string)

在Windows shell中打印此字符串给我预期的字符,虽然彼此分开一个空格太多。



但是当我打开输出文件时,每两行一次都被打破(尽管添加空格消失了)



某些上下文要将列转储到文件中,我使用这个脚本: spWriteStringTofile ,我相信使用默认的服务器编码。



经过一些研究,似乎这个编码是 SQL_Latin1_General_CP1_CI_AS 。我尝试在脚本开头添加# - * - 编码:latin_1 - * ,我尝试将SQL Server中的编码转换为 Latin1_General_CI_AS ,我试图 string.decode('latin_1')。encode('utf8')但它没有改变一个东西(除了最后一次尝试



我可以尝试什么?






编辑2:我尝试了 newFile.write(line.decode('utf-16-be')。encode('utf-16-le')) 解决方案,在我的文件的第一行抛出一个错误。从python GUI:

 (Pdb)打印行
ÿþ

(Pdb) print('utf-16-le')
'\\
*** UnicodeDecodeError:'utf16'编解码器无法解码位置2中的字节0x0a:截断数据

只有一个换行符出现在Sublime Text 2的第一行...



当我绕过它( try:...除了:通过,快速和脏),在正确和错误的行之间添加换行符,但是断字符仍然在这里。






编辑:我逐行浏览文档

  newFile = open('newfile','a +')
with open('input')as fp:
for fp中的行:
import pdb
pdb.set_trace )
newFile.write(line)

在pdb中,在错误的行上: / p>

 ( Pdb)打印行
as S旧D ebitor,#< - 不是实际的复制粘贴
(Pdb)打印repr(行)
'\x00\t\x00 \ x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a \\ \\ x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00 \x00\r \ x00\\\
'

然而由于某种原因我无法复制/粘贴打印行值:我可以复制单个字母字符,但是当我选择了它们的空白时,不能...






输入:

  r< = @ Data2 then(当@ Deviza =''或@ Deviza = @ sMoneda 
然后isnull(Debit,0)else isnull(DevDebit,0)end)
else 0 end)
- Sum (DataIn r = = @BeginDate和DataInr< = @ Data2
then(当@ Deviza =''或@ Deviza = @ sMoneda
时,isnull(Credit,0)else isnull(DevCredit,0)结束)
else 0 end)
else 0结束
作为SoldDebitor,

输出:

  r< = @ Data2 then(当@ Deviza ='或@ Deviza = @ sMoneda 
then isnull(Debit,0)else isnull(DevDebit,0)end)
਍ऀ攀氀猀攀攀渀搀⤀ഀഀ
- Sum(Da​​taInr> = @ BeginDate和DataInr< = @ Data2
then(当@ Deviza =''或@ Deviza = @ sMoneda
时的情况isnull(Credit,0)else isnull(DevCredit,0)end)
਍ऀ攀氀猀攀攀渀搀⤀ഀഀ
else 0结束
਍ऀ愀猀匀漀氀搀䐀攀恋碗琀漀爀Ⰰഀഀ


解决方案

您损坏的数据是UTF-16,使用大字节顺序:

 >>> line ='\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \ x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\\ \\ x00r\x00,\x00\r\x00\\\
'
>>>> line.decode('utf-16-be')
u'\t as SoldDebitor,\r\\\
'

但无论您再次阅读文件,都是以小尾数字节顺序解释数据UTF-16:

 >>> print data.decode('utf-16-le')
ऀ愀猀匀漀氀搀䐀攀恋碗琀漀爀Ⰰഀ਀

这很可能是因为您在文件开头没有包含BOM,或者您输入的数据已经损坏。



你真的不应该读取UTF-16数据,在文本模式下,没有解码,因为在两个字节中编码的新行几乎被保证被破坏,导致逐字节顺序错误,这也可能导致每隔一行或几乎每隔一行被破坏。



使用 io.open() 读取 unicode 数据相反:

  import io 

与io.open('input','r' ='utf16')as infh:
string = infh.read()

#做东西

与io.open('output','w + ,encoding ='utf16')as outfh:
outfh.write(string)

因为看起来你的输入文件已经具有UTF-16 BOM。



这的确意味着需要调整其余的代码来处理Unicode字符串而不是字节字符串。 p>

I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files.

My problem is that I can't get python to use the correct encoding, and while the input files appear fine on my text editor the output ones have broken characters apparently once every two lines.

My python code can be reduced to:

input = open('input', 'r')
string = input.read()
# Do stuff
output = open('output', 'w+')
output.write(string)

Printing this string in the windows shell gives me the expected characters, though separated from each other by one space too many.

But when I open the output file, once every two line everything is broken (though the "added" whitespaces have disappeared)

Some context: To dump the column to files, I'm using this script: spWriteStringTofile which I believe is using the default server encoding.

After some research, it appears that this encoding is SQL_Latin1_General_CP1_CI_AS. I tried adding # -*- coding: latin_1 -* at the beginning of the script, I tried converting the encoding inside SQL Server to Latin1_General_CI_AS, I tried to string.decode('latin_1').encode('utf8') but it didn't change a thing (except the last attempt which output only broken characters).

What can I try ?


EDIT 2: I tried the newFile.write(line.decode('utf-16-be').encode('utf-16-le')) solution, with throws an error at the first line of my file. From the python GUI:

(Pdb) print line
ÿþ

(Pdb) print repr(line)
'\xff\xfe\n'
(Pdb) line.decode('utf-16-be').encode('utf-16-le')
*** UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 2: truncated data

Only a newline appear in Sublime Text 2 for this first line...

When I bypass it (try: ... except: pass, quick&dirty), a newline is added between correct and incorrect lines, but the broken characters are still here.


EDIT: I went through the document line by line

newFile = open('newfile', 'a+')
with open('input') as fp:
    for line in fp:
        import pdb
        pdb.set_trace()
        newFile.write(line)

In the pdb, on a faulty line:

(Pdb) print line
                           a s  S o l d D e b i t o r , # <-- Not actual copy paste
(Pdb) print repr(line)
'\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'

However for some reason I couldn't copy/paste the print line value: I can copy the individual alphabetic character but not when I select the "whitespace" that is betzeen them...


Input:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda 
    then isnull(Debit,0) else isnull(DevDebit,0) end)
    else 0 end) 
     - Sum(case when DataInr >= @BeginDate and DataInr <= @Data2 
       then  (case when @Deviza='' or @Deviza=@sMoneda 
       then  isnull(Credit,0) else isnull(DevCredit,0) end)
       else 0 end) 
       else 0 end
    as SoldDebitor,

Output:

r <= @Data2 then (case when @Deviza='' or @Deviza=@sMoneda 
            then  isnull(Debit,0) else isnull(DevDebit,0) end)
਍ऀ                       攀氀猀攀   攀渀搀⤀ ഀഀ
      - Sum(case when DataInr >= @BeginDate and DataInr <= @Data2 
            then  (case when @Deviza='' or @Deviza=@sMoneda
            then  isnull(Credit,0) else isnull(DevCredit,0) end)
਍ऀ                       攀氀猀攀   攀渀搀⤀ ഀഀ
        else 0 end
਍ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀഀ

解决方案

Your corrupted data is UTF-16, using big-endian byte order:

>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'
>>> line.decode('utf-16-be')
u'\t                 as SoldDebitor,\r\n'

but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:

>>> print data.decode('utf-16-le')
ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ਀

That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.

You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.

Use io.open() to read unicode data instead:

import io

with io.open('input', 'r', encoding='utf16') as infh:
    string = infh.read()

# Do stuff

with io.open('output', 'w+', encoding='utf16') as outfh:
    outfh.write(string)

because it appears your input file already has a UTF-16 BOM.

This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.

这篇关于为什么Python每两行写一个错误编码的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆