utf-8中的Apache Beam 2.7.0摇篮,解码法语字符 [英] apache beam 2.7.0 craches in utf-8 decoding french characters

查看:82
本文介绍了utf-8中的Apache Beam 2.7.0摇篮,解码法语字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Google Cloud平台存储桶中将一个csv写入数据存储,其中包含法语字符/重音,但是我收到有关解码的错误消息.

I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.

尝试从" latin-1 "到" utf-8 "的编码和解码失败后(使用 unicode,unicodedata和编解码器 )我尝试手动更改内容...

After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...

我使用的操作系统默认具有" ascii "编码,我在" Anaconda3/envs/py27/lib/site.py "中手动将其更改为utf-8.

The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "utf-8" # Default value set by _PyUnicode_Init()
    sys.setdefaultencoding("utf-8")

我已经在本地尝试了一个测试文件,方法是先打印,然后将带有重音符号的字符串写入文件中,然后就可以了!

I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!

string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)

with  open('./test.txt', 'w') as outfile:
    outfile.write(test_decode)

但是apache_beam没有运气...

But no luck with apache_beam...

然后,我尝试手动更改"/usr/lib/python2.7/encodings/utf_8.py ",然后将"忽略"而不是" strict "添加到codecs.utf_8_decode

Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode

def decode(input, errors='ignore'):
    return codecs.utf_8_decode(input, errors, True)

但是我已经意识到apache_beam不使用该文件,或者至少没有考虑到任何更改

but I've realized that apache_beam do not use this file or at least does not take it into account any changes

有什么想法如何处理吗?

Any ideas how to deal with it?

请在错误消息下方找到

Traceback (most recent call last):
  File "etablissementsFiness.py", line 146, in <module>
    dataflow(run_locally)
  File "etablissementsFiness.py", line 140, in dataflow
    | 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
  File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
    self.run().wait_until_finish()
  File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    def start(self):
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    with self.scoped_start_state:
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    with self.spec.source.reader() as reader:
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    for value in reader:
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
    yield self._coder.decode(record)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
    return value.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte

推荐答案

尝试编写CustomCoder类,并在解码时忽略"任何错误:

Try to write a CustomCoder class and "ignore" any errors while decoding:

from apache_beam.coders.coders import Coder

class CustomCoder(Coder):
    """A custom coder used for reading and writing strings as UTF-8."""

    def encode(self, value):
        return value.encode("utf-8", "replace")

    def decode(self, value):
        return value.decode("utf-8", "ignore")

    def is_deterministic(self):
        return True

然后,使用coder=CustomCoder()读写文件:

lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())

# More processing code here...

output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())

这篇关于utf-8中的Apache Beam 2.7.0摇篮,解码法语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆