在python中读取Avro文件时出错 [英] Error when reading avro files in python
问题描述
我在Python中成功安装了Apache Avro.然后按照以下说明尝试将Avro文件读入Python.
https://avro.apache.org/docs/1.8.1/gettingstartedpython.html
我在目录中有一堆Avros,该目录已经在Python中设置为正确的路径.这是我的代码:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
但是它返回此错误:
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
self._read_header()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
META_SCHEMA, META_SCHEMA, self.raw_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
return self.read_fixed(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
return decoder.read(writer_schema.size)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
input_bytes = self.reader.read(n)
File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>
我确实知道,在指令中的示例中,首先创建了一个架构.但是什么是avsc文件?在我的情况下,应如何创建它和相应的架构? 理想情况下,我想将Avro文件读入Python,并以Python的磁盘或数据帧/列表类型将其保存为csv格式,以供进一步分析.我正在Windows 7上使用Python 3. >
已编辑 我尝试了Stephane的代码,并返回了新错误
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'
EDITED2 :在大多数情况下,Stephane的代码都可以运行,但有时会引发这样的AssertionError
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n 1. KEY INFORMATION \n \n (a) Full name of discloser: BlackRock, Inc. \n------------------------------------------------------------------------------------------------- ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe
您正在使用Windows和Python 3.
-
默认情况下,Python 3中的
-
open
以文本模式打开文件.这意味着当进一步的读取操作发生时,Python将尝试将文件的内容从某些字符集解码为unicode. -
您没有指定默认字符集,因此Python尝试对内容进行解码,就好像这些内容是使用
charmap
编码(Windows上默认为)一样. -
很明显,您的avro文件未在charmap中进行编码,并且解码失败并出现异常
据我记得, -
反而是avro标头是二进制内容...不是文本(对此不确定).所以也许首先您不应该尝试用open来解码文件:
reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())
(注意'rb'
,二进制模式)
对于下一个问题(AttributeError),您遇到了一个已知错误,该错误未在1.8.1中修复.在下一个版本发布之前,您可以执行以下操作:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io
class MyDataFileReader(DataFileReader):
def __init__(self, reader, datum_reader):
"""Initializes a new data file reader.
Args:
reader: Open file to read from.
datum_reader: Avro datum reader.
"""
self._reader = reader
self._raw_decoder = avro_io.BinaryDecoder(reader)
self._datum_decoder = None # Maybe reset at every block.
self._datum_reader = datum_reader
# read the header: magic, meta, sync
self._read_header()
# ensure codec is valid
avro_codec_raw = self.GetMeta('avro.codec')
if avro_codec_raw is None:
self.codec = "null"
else:
self.codec = avro_codec_raw.decode('utf-8')
if self.codec not in VALID_CODECS:
raise DataFileException('Unknown codec: %s.' % self.codec)
self._file_length = self._GetInputFileLength()
# get ready to read
self._block_count = 0
self.datum_reader.writer_schema = (
schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))
reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
尽管如此愚蠢的bug可能会发布,这很奇怪,但这并不意味着代码已经成熟!
I installed Apache Avro successfully in Python. Then I try to read Avro files into Python following the instruction below.
https://avro.apache.org/docs/1.8.1/gettingstartedpython.html
I have a bunch of Avros in a directory which has already been set as the right path in Python. Here is my code:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
However it returns this error:
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
self._read_header()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
META_SCHEMA, META_SCHEMA, self.raw_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
return self.read_fixed(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
return decoder.read(writer_schema.size)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
input_bytes = self.reader.read(n)
File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>
I am indeed aware that in the example in the instruction, a schema is created first. But what is a avsc file? How shall I create it and the corresponding schema in my case? Ideally, I would like to read Avro files into Python and save it into csv format in the disk or dataframe/list type in Python for further analysis. I'm using Python 3 on Windows 7.
EDITED I tried Stephane's code, and it returns a new error
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'
EDITED2: Stephane's code works in most cases, but sometimes it incurs AssertionError like this
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n 1. KEY INFORMATION \n \n (a) Full name of discloser: BlackRock, Inc. \n------------------------------------------------------------------------------------------------- ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe
You're using windows and Python 3.
in Python 3 by default
open
opens files in text mode. It means that when further read operations happen, Python will try to decode the content of the file from some charset to unicode.you did not specify a default charset, so Python tries to decode the content as if such content was encoded using
charmap
(by default on windows).obviously your avro file is not encoded in charmap, and the decoded fails with an exception
as far as i remember, avro headers anyway are binary content... not textual (not sure about that). so maybe first you should try NOT to decode the file with open:
reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())
(notice 'rb'
, binary mode)
EDIT: For the next problem (AttributeError), you've been hit by a known bug that's not fixed in 1.8.1. Until next version is out, you could just do something like:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io
class MyDataFileReader(DataFileReader):
def __init__(self, reader, datum_reader):
"""Initializes a new data file reader.
Args:
reader: Open file to read from.
datum_reader: Avro datum reader.
"""
self._reader = reader
self._raw_decoder = avro_io.BinaryDecoder(reader)
self._datum_decoder = None # Maybe reset at every block.
self._datum_reader = datum_reader
# read the header: magic, meta, sync
self._read_header()
# ensure codec is valid
avro_codec_raw = self.GetMeta('avro.codec')
if avro_codec_raw is None:
self.codec = "null"
else:
self.codec = avro_codec_raw.decode('utf-8')
if self.codec not in VALID_CODECS:
raise DataFileException('Unknown codec: %s.' % self.codec)
self._file_length = self._GetInputFileLength()
# get ready to read
self._block_count = 0
self.datum_reader.writer_schema = (
schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))
reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
It is very strange that such stupid bug could go to releases though, and that's not a sign a code maturity!
这篇关于在python中读取Avro文件时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!