“utf-8"编解码器无法解码位置 18 中的字节 0x92:起始字节无效 [英] 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

查看:145
本文介绍了“utf-8"编解码器无法解码位置 18 中的字节 0x92:起始字节无效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取一个名为 df1 的数据集,但它不起作用

将pandas导入为pddf1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")df1.head()

上面的代码有很大的错误,但这是最相关的

UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 18 的字节 0x92:无效的起始字节

解决方案

数据确实没有编码为 UTF-8;除了单个 0x92 字节外,一切都是 ASCII:

b'Korea, Dem.人们\x92s 代表.'

将其解码为 Windows 代码页 1252,其中 0x92 是花哨的引用,<代码>':

df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='cp1252')

演示:

<预><代码>>>>将熊猫导入为 pd>>>df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",... sep=";", encoding='cp1252')>>>df1.head()2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \0 阿富汗 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.61 阿尔巴尼亚 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.82 阿尔及利亚 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.53 美属萨摩亚 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 安道尔 .. . . . . . . . . . . . . . . . . . . . . . . .2010 2011 2012 2013 未命名:15 2014 20150 59.0 59.3 59.7 60.0 NaN 60.4 60.71 77.0 77.2 77.4 77.6 NaN 77.8 78.02 73.8 74.1 74.3 74.6 NaN 74.8 75.03 .. .. .. .. NaN .. ..4 .. .. .. 南 .. ..

然而,我注意到,当您从 URL 加载数据时,Pandas 似乎也从表面上获取 HTTP 标头并生成 Mojibake.当我将数据直接保存到磁盘时,然后pd.read_csv() 加载它,数据被正确解码,但从 URL 加载会产生重新编码的数据:

<预><代码>>>>df1[''][102]'韩国,民主党.人民代表.>>>df1[' '][102].encode('cp1252').decode('utf8')'韩国,民主党.人民代表.

这是一个 Pandas 中的已知错误.您可以使用 urllib.request 解决此问题 加载 URL 并将其传递给 pd.read_csv():

<预><代码>>>>导入 urllib.request>>>使用 urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") 作为响应:... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')...>>>df1[''][102]'韩国,民主党.人民代表.

I am trying to read in a dataset called df1, but it does not work

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")

df1.head()

Here are huge errors from the above code, but this is the most relevant

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

解决方案

The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:

b'Korea, Dem. People\x92s Rep.'

Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, :

df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
                  sep=";", encoding='cp1252')

Demo:

>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
...                   sep=";", encoding='cp1252')
>>> df1.head()
                   2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  \
0     Afghanistan  55.1  55.5  55.9  56.2  56.6  57.0  57.4  57.8  58.2  58.6
1         Albania  74.3  74.7  75.2  75.5  75.8  76.1  76.3  76.5  76.7  76.8
2         Algeria  70.2  70.6  71.0  71.4  71.8  72.2  72.6  72.9  73.2  73.5
3  American Samoa    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
4         Andorra    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..

   2010  2011  2012  2013  Unnamed: 15  2014  2015
0  59.0  59.3  59.7  60.0          NaN  60.4  60.7
1  77.0  77.2  77.4  77.6          NaN  77.8  78.0
2  73.8  74.1  74.3  74.6          NaN  74.8  75.0
3    ..    ..    ..    ..          NaN    ..    ..
4    ..    ..    ..    ..          NaN    ..    ..

I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with pd.read_csv() the data is correctly decoded, but loading from the URL produces re-coded data:

>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'

This is a known bug in Pandas. You can work around this by using urllib.request to load the URL and pass that to pd.read_csv() instead:

>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
...     df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'

这篇关于“utf-8"编解码器无法解码位置 18 中的字节 0x92:起始字节无效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆