在不使用 pandas 的情况下以CSV格式创建新列表:返回UnicodeDecodeError [英] Make a new list in CSV without using pandas: return UnicodeDecodeError

查看:81
本文介绍了在不使用 pandas 的情况下以CSV格式创建新列表:返回UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在现有的csv文件中创建一个新列表(不使用熊猫)。
这是我的代码:

I am trying to make a new list in my existing csv file (not using pandas). Here is my code:

with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv','r') as case0:
    reader = csv.DictReader(case0)
    album = []
    for row in reader:
        album.append(row)
print ("Number of albums is:",len(album))

已下载CSV文件摘自 Rolling Stone在data.world 上的前500张专辑数据集。

The CSV file was downloaded from the Rolling Stone's Top 500 albums data set on data.world.

我的逻辑是创建一个名为专辑的空列表,并在此列表中包含所有记录。但是看来行在阅读器中的行出现了问题。

My logic is to create an empty list named album and have all the records in this list. But it seems the line of for row in reader has some issue.

我收到的错误消息是:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 1040: invalid continuation byte

有人可以让我知道我做错了什么吗?

Can anyone let me know what I did wrong?

推荐答案

您需要使用正确的编解码器打开文件; UTF-8不是正确的。数据集未指定,但我确定最可能的编解码器是 mac_roman

You need to open the file in the correct codec; UTF-8 is not the correct one. The dataset doesn't specify it, but I have determined that the most likely codec is mac_roman:

with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv', 'r', encoding='mac_roman') as case0:

原始的Kaggle数据集不必费心去记录它,并且使用该集合的各种内核都破坏了编码。显然,它是一个8位拉丁变量(大多数数据是ASCII码,带有几个单独的8位代码点)。

The original Kaggle dataset doesn't bother to document it, and the various kernels that use the set all just clobber the encoding. It's clearly a 8-bit Latin-variant (the majority of the data is ASCII with a few individual 8-bit codepoints).

所以我分析了数据,发现在9行中只有两个这样的代码点:

So I analysed the data, and found there are just two such codepoints in 9 rows:

>>> import re
>>> eightbit = re.compile(rb'[\x80-\xff]')
>>> with open('albumlist.csv', 'rb') as bindata:
...     nonascii = [l for l in bindata if eightbit.search(l)]
...
>>> len(nonascii)
9
>>> {c for l in nonascii for c in eightbit.findall(l)}
{b'\x89', b'\xca'}

0x89字节仅出现在一行中:

The 0x89 byte appears in just one line:

>>> sum(l.count(b'\x89') for l in nonascii)
1
>>> sum(l.count(b'\xca') for l in nonascii)
22
>>> next(l for l in nonascii if b'\x89' in l)
b'359,1972,Honky Ch\x89teau,Elton John,Rock,"Pop Rock,\xcaClassic Rock"\r\n'

这显然是埃尔顿·约翰(Elton John)1972年的 HonkyChâteau专辑,因此0x89字节必须代表带有圆弧符号的U + 00E2拉丁文小写字母A。

That's clearly Elton John's 1972 Honky Château album, so the 0x89 byte must represent the U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX codepoint.

所有0xCA字节似乎都表示一个替代的空格字符,它们都在genre和subgenre列中的逗号后出现(在一个专辑例外的情况下):

The 0xCA bytes all appear to represent an alternative space character, they all appear righ after commas in the genre and subgenre columns (with one album exception):

>>> import csv
>>> for row in csv.reader((l.decode('ascii', 'backslashreplace') for l in nonascii)):
...     for col in row:
...         if '\\' in col: print(col)
...
Reggae,\xcaPop,\xcaFolk, World, & Country,\xcaStage & Screen
Reggae,\xcaRoots Reggae,\xcaRocksteady,\xcaContemporary,\xcaSoundtrack
Electronic,\xcaStage & Screen
Soundtrack,\xcaDisco
Rock,\xcaBlues
Blues Rock,\xcaElectric Blues,\xcaHarmonica Blues
Garage Rock,\xcaPsychedelic Rock
Honky Ch\x89teau
Pop Rock,\xcaClassic Rock
Funk / Soul,\xcaFolk, World, & Country
Rock,\xcaPop
Stan Getz\xca/\xcaJoao Gilberto\xcafeaturing\xcaAntonio Carlos Jobim
Bossa Nova,\xcaLatin Jazz
Lo-Fi,\xcaIndie Rock

这些0xCA字节几乎可以表示 U + 00A0无间断空格代码点。

These 0xCA bytes are almost certainly representing the U+00A0 NO-BREAK SPACE codepoint.

使用这两个映射,您可以尝试确定哪些8位编解码器将进行相同的映射。而不是手动尝试所有Python的编解码器我使用的是 Tripleee的8位编解码器映射,以了解哪些编解码器使用了这些映射。只有两个:

With these two mappings, you can try to determine what 8-bit codecs would make the same mapping. Rather than manually try out all Python's codecs I used Tripleee's 8-bit codec mapping to see what codecs use these mappings. There are only two:



  • 0x89

â(U + 00E2):mac_arabic,mac_croatian,mac_farsi,mac_greek,mac_iceland,mac_roman,mac_romanian,mac_turkish

â‎ (U+00E2): mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish


  • 0xca

  • 0xca

(U + 00A0):mac_centeuro,mac_croatian,mac_cyrillic,mac_greek,mac_iceland,mac_latin2,mac_roman,mac_romanian,mac_turkish

‎ (U+00A0): mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish

两组都列出了6种编码:

There are 6 encodings that are listed in both sets:

>>> set1 = set('mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set2 = set('mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set1 & set2
{'mac_turkish', 'mac_iceland', 'mac_romanian', 'mac_greek', 'mac_croatian', 'mac_roman'}

其中, Mac OS Roman mac_roman 编解码器可能最有可能被用作Mac的Microsoft Excel 长时间使用Mac Roman创建CSV文件。但是,这并不重要,这6个都可以在这里工作。

Of these, Mac OS Roman mac_roman codec is probably the most likely to have been used as Microsoft Excel for Mac used Mac Roman to create CSV files for a long time. However, it doesn't really matter, any of those 6 would work here.

如果要拆分,则可能要替换那些U + 00A0不间断空格排除流派和子流派列(如果它们来自Discogs,则为流派和 style 列)。

You may want to replace those U+00A0 non-breaking spaces if you want to split out the genre and subgenre columns (really the genre and style columns if these were taken from Discogs).

这篇关于在不使用 pandas 的情况下以CSV格式创建新列表:返回UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆