如何从csv读取字节对象? [英] How to read bytes object from csv?

查看:63
本文介绍了如何从csv读取字节对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用 tweepy 将推文文本存储在使用 Python csv.writer() 的 csv 文件中,但是我必须在存储之前以 utf-8 对文本进行编码,否则 tweepy 会抛出一个奇怪的错误.

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.

现在,文本数据存储如下:

Now, the text data is stored like this:

"b'Lorem Ipsum\xc2\xa0Assignment '"

我尝试使用此代码对其进行解码(其他列中有更多数据,文本位于第 3 列):

I tried to decode this using this code (there is more data in other columns, text is in 3rd column):

with open('data.csv','rt',encoding='utf-8') as f:
    reader = csv.reader(f,delimiter=',')
    for row in reader:
        print(row[3])

但是,它不会解码文本.我不能使用 .decode('utf-8') 作为 csv 阅读器读取数据作为字符串即 type(row[3])'str' 并且我似乎无法将其转换为 bytes,数据再次被编码!

But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!

如何解码文本数据?

这是来自 csv 文件的示例行:

Here's a sample line from the csv file:

67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6  | @abcde',52,18

注意:如果解决方案是在编码过程中,请注意我无法再次下载整个数据.

Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.

推荐答案

如果您的输入文件确实包含带有 Python 语法 b 前缀的字符串,一种解决方法(即使它不是真的要包含的 csv 数据的有效格式)将使用 Python 的 ast.literal_eval 函数@Ryan 提到了虽然我会以稍微不同的方式使用它,如下所示.

If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval function @Ryan mentioned although I would use it in a slightly different way, as shown below.

这将提供一种安全的方式来解析文件中以 b 为前缀的字符串,表示它们是字节字符串.其余的将不变地通过.

This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.

import ast
import csv


def _parse_bytes(field):
    """ Convert string represented in Python byte-string literal b'' syntax into
        a decoded character string - otherwise return it unchanged.
    """
    result = field
    try:
        result = ast.literal_eval(field)
    finally:
        return result.decode() if isinstance(result, bytes) else field


def my_csv_reader(filename, /, **kwargs):
    with open(filename, 'rt', newline='') as file:
        for row in csv.reader(file, **kwargs):
            yield [_parse_bytes(field) for field in row]


reader = my_csv_reader('bytes_data.csv', delimiter=','):
for row in reader:
    print(row)

这篇关于如何从csv读取字节对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆