在python中读取Avro文件时出错 [英] Error when reading avro files in python

查看：173 发布时间：2020/9/15 5:12:44 python-3.x avro

本文介绍了在python中读取Avro文件时出错的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Python中成功安装了Apache Avro.然后按照以下说明尝试将Avro文件读入Python.

https://avro.apache.org/docs/1.8.1/gettingstartedpython.html

我在目录中有一堆Avros，该目录已经在Python中设置为正确的路径.这是我的代码:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
   print (user)
reader.close()

但是它返回此错误:

 Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
    self._read_header()
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
    META_SCHEMA, META_SCHEMA, self.raw_decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
    return self.read_record(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
    field_val = self.read_data(field.type, readers_field.type, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
    return self.read_fixed(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
    return decoder.read(writer_schema.size)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
    input_bytes = self.reader.read(n)
  File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>

我确实知道，在指令中的示例中，首先创建了一个架构.但是什么是avsc文件?在我的情况下，应如何创建它和相应的架构? 理想情况下，我想将Avro文件读入Python，并以Python的磁盘或数据帧/列表类型将其保存为csv格式，以供进一步分析.我正在Windows 7上使用Python 3. >

已编辑 我尝试了Stephane的代码，并返回了新错误

 Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
    self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'

EDITED2 :在大多数情况下，Stephane的代码都可以运行，但有时会引发这样的AssertionError

Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n   1.         KEY INFORMATION \n \n (a) Full name of discloser:                                                                        BlackRock, Inc. \n-------------------------------------------------------------------------------------------------  ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe

解决方案

您正在使用Windows和Python 3.

open以文本模式打开文件.这意味着当进一步的读取操作发生时，Python将尝试将文件的内容从某些字符集解码为unicode.
您没有指定默认字符集，因此Python尝试对内容进行解码，就好像这些内容是使用charmap编码(Windows上默认为)一样.
很明显，您的avro文件未在charmap中进行编码，并且解码失败并出现异常
反而是avro标头是二进制内容...不是文本(对此不确定).所以也许首先您不应该尝试用open来解码文件:

reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())

(注意'rb'，二进制模式)

对于下一个问题(AttributeError)，您遇到了一个已知错误，该错误未在1.8.1中修复.在下一个版本发布之前，您可以执行以下操作:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io


class MyDataFileReader(DataFileReader):
    def __init__(self, reader, datum_reader):
        """Initializes a new data file reader.

        Args:
          reader: Open file to read from.
          datum_reader: Avro datum reader.
        """
        self._reader = reader
        self._raw_decoder = avro_io.BinaryDecoder(reader)
        self._datum_decoder = None  # Maybe reset at every block.
        self._datum_reader = datum_reader

        # read the header: magic, meta, sync
        self._read_header()

        # ensure codec is valid
        avro_codec_raw = self.GetMeta('avro.codec')
        if avro_codec_raw is None:
            self.codec = "null"
        else:
            self.codec = avro_codec_raw.decode('utf-8')
        if self.codec not in VALID_CODECS:
            raise DataFileException('Unknown codec: %s.' % self.codec)

        self._file_length = self._GetInputFileLength()

        # get ready to read
        self._block_count = 0
        self.datum_reader.writer_schema = (
            schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))


reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
    print (user)
reader.close()

尽管如此愚蠢的bug可能会发布，这很奇怪，但这并不意味着代码已经成熟！

I installed Apache Avro successfully in Python. Then I try to read Avro files into Python following the instruction below.

https://avro.apache.org/docs/1.8.1/gettingstartedpython.html

I have a bunch of Avros in a directory which has already been set as the right path in Python. Here is my code:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
   print (user)
reader.close()

However it returns this error:

Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
    self._read_header()
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
    META_SCHEMA, META_SCHEMA, self.raw_decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
    return self.read_record(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
    field_val = self.read_data(field.type, readers_field.type, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
    return self.read_fixed(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
    return decoder.read(writer_schema.size)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
    input_bytes = self.reader.read(n)
  File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>

I am indeed aware that in the example in the instruction, a schema is created first. But what is a avsc file? How shall I create it and the corresponding schema in my case? Ideally, I would like to read Avro files into Python and save it into csv format in the disk or dataframe/list type in Python for further analysis. I'm using Python 3 on Windows 7.

EDITED I tried Stephane's code, and it returns a new error

Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
    self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'

EDITED2: Stephane's code works in most cases, but sometimes it incurs AssertionError like this

Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n   1.         KEY INFORMATION \n \n (a) Full name of discloser:                                                                        BlackRock, Inc. \n-------------------------------------------------------------------------------------------------  ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe

解决方案

You're using windows and Python 3.

in Python 3 by default open opens files in text mode. It means that when further read operations happen, Python will try to decode the content of the file from some charset to unicode.
you did not specify a default charset, so Python tries to decode the content as if such content was encoded using charmap (by default on windows).
obviously your avro file is not encoded in charmap, and the decoded fails with an exception
as far as i remember, avro headers anyway are binary content... not textual (not sure about that). so maybe first you should try NOT to decode the file with open:

reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())

(notice 'rb', binary mode)

EDIT: For the next problem (AttributeError), you've been hit by a known bug that's not fixed in 1.8.1. Until next version is out, you could just do something like:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io


class MyDataFileReader(DataFileReader):
    def __init__(self, reader, datum_reader):
        """Initializes a new data file reader.

        Args:
          reader: Open file to read from.
          datum_reader: Avro datum reader.
        """
        self._reader = reader
        self._raw_decoder = avro_io.BinaryDecoder(reader)
        self._datum_decoder = None  # Maybe reset at every block.
        self._datum_reader = datum_reader

        # read the header: magic, meta, sync
        self._read_header()

        # ensure codec is valid
        avro_codec_raw = self.GetMeta('avro.codec')
        if avro_codec_raw is None:
            self.codec = "null"
        else:
            self.codec = avro_codec_raw.decode('utf-8')
        if self.codec not in VALID_CODECS:
            raise DataFileException('Unknown codec: %s.' % self.codec)

        self._file_length = self._GetInputFileLength()

        # get ready to read
        self._block_count = 0
        self.datum_reader.writer_schema = (
            schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))


reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
    print (user)
reader.close()

It is very strange that such stupid bug could go to releases though, and that's not a sign a code maturity!

这篇关于在python中读取Avro文件时出错的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中读取Avro文件时出错 [英] Error when reading avro files in python

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在python中读取Avro文件时出错 [英] Error when reading avro files in python

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭