读取文本文件的行并获得charmap解码错误 [英] Read lines of a textfile and getting charmap decode error

查看:78
本文介绍了读取文本文件的行并获得charmap解码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

im使用python3.3和sqlite3数据库。我有一个大约270mb的大文本文件,可以在Windows7中用写字板打开。

im using python3.3 and a sqlite3 database. I have a big textfile around 270mb big which i can open with WordPad in Windows7.

该文件中的每一行如下:

Each line in that file looks as follows:

term数tn

我想读取每一行并将值保存在数据库中。我的代码如下:

I want to read every line and save the values in a database. My Code looks as follows:

f = open('sorted.de.word.unigrams', "r")
for line in f:

    #code

我能够将所有数据读取到我的数据库中,但仅读取到某一行,我建议可能是所有行的一半。然后我收到以下错误:

I was able to read all data into my database but just to a certain line, i would suggest maybe half of all lines. Then im getting the following error:

File "C:\projects\databtest.py", line 18, in <module>
for line in f:
File "c:\python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 140: character maps to   <undefined>

我尝试使用encoding = utf-8打开文件,但是即使其他编解码器也无法正常工作。
然后我试图通过另存为utf-8 txt文件来用写字板制作副本。但是写字板崩溃了。

I tried to open the file with encoding = utf-8 but nothing worked even other codecs. Then i tried to make a copy with WordPad via save as utf-8 txt file. But WordPad crashed.

这里的问题出在哪里,看来这行中有些字符是python无法处理的。
我该怎么做才能完全读取我的文件?还是有可能忽略此类错误消息而仅继续下一行?

Where is the problem here, it looks like there is some character in that line that python cant handle. What can i do to completely read my file? Or is it maybe possible to ignore such Error messages and just go on with the next line?

您可以在此处下载打包文件:

You can download the packed file here:

http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=frequency_lists:sorted.de.word.unigrams.7z

非常感谢!

推荐答案

我检查了文件,问题的根源似乎在于该文件包含至少采用两种编码的单词:可能是cp1252和cp850。字符0x81在cp850中为ü,但在cp1252中未定义。您可以通过捕获异常来处理这种情况,但是在cp1252中,某些其他德语字符会映射为有效但错误的字符。如果您对这样不完善的解决方案感到满意,请按以下步骤操作:

I checked the file, and the root of the problem seems to be that the file contains words in at least two encodings: probably cp1252 and cp850. The character 0x81 is ü in cp850 but undefined in cp1252. You can handle that situation by catching the exception, but some other German characters map to valid but wrong characters in cp1252. If you are happy with such an imperfect solution, here's how you could do it:

with open('sorted.de.word.unigrams','rb') as f: #open in binary mode
    for line in f:
        for cp in ('cp1252', 'cp850'):
            try:
                s = line.decode(cp)
            except UnicodeDecodeError:
                pass
            else:
                store_to_db(s)
                break

这篇关于读取文本文件的行并获得charmap解码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆