如何从csv读取字节对象? [英] How to read bytes object from csv?
问题描述
我已经使用 tweepy 将推文文本存储在使用 Python csv.writer(
) 的 csv 文件中,但是我必须在存储之前以 utf-8 对文本进行编码,否则 tweepy 会抛出一个奇怪的错误.
I have used tweepy to store the text of tweets in a csv file using Python csv.writer(
), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
现在,文本数据存储如下:
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
我尝试使用此代码对其进行解码(其他列中有更多数据,文本位于第 3 列):
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
但是,它不会解码文本.我不能使用 .decode('utf-8')
作为 csv 阅读器读取数据作为字符串即 type(row[3])
是 'str'
并且我似乎无法将其转换为 bytes
,数据再次被编码!
But, it doesn't decode the text. I cannot use .decode('utf-8')
as the csv reader reads data as strings i.e. type(row[3])
is 'str'
and I can't seem to convert it into bytes
, the data gets encoded once more!
如何解码文本数据?
这是来自 csv 文件的示例行:
Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | @abcde',52,18
注意:如果解决方案是在编码过程中,请注意我无法再次下载整个数据.
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
推荐答案
如果您的输入文件确实包含带有 Python 语法 b
前缀的字符串,一种解决方法(即使它不是真的要包含的 csv 数据的有效格式)将使用 Python 的 ast.literal_eval
函数@Ryan 提到了虽然我会以稍微不同的方式使用它,如下所示.
If your input file really contains strings with Python syntax b
prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval
function @Ryan mentioned although I would use it in a slightly different way, as shown below.
这将提供一种安全的方式来解析文件中以 b
为前缀的字符串,表示它们是字节字符串.其余的将不变地通过.
This will provide a safe way to parse strings in the file which are prefixed with a b
indicating they are byte-strings. The rest will be passed through unchanged.
import ast
import csv
def _parse_bytes(field):
""" Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else field
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'rt', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=','):
for row in reader:
print(row)
这篇关于如何从csv读取字节对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!