Apache Beam 2.7.0 在 utf-8 解码法语字符中出现问题 [英] apache beam 2.7.0 craches in utf-8 decoding french characters

查看:22
本文介绍了Apache Beam 2.7.0 在 utf-8 解码法语字符中出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一桶 google 云平台中的 csv 写入数据存储区,其中包含法语字符/重音,但我有关于解码的错误消息.

I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.

尝试从latin-1"编码和解码到utf-8"但没有成功(使用unicode、unicodedata和编解码器) 我尝试手动更改内容...

After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...

我使用的操作系统默认使用ascii"编码,我在Anaconda3/envs/py27/lib/site.py"中手动更改为utf-8.

The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "utf-8" # Default value set by _PyUnicode_Init()
    sys.setdefaultencoding("utf-8")

我已经在本地使用测试文件进行了尝试,通过打印然后将带重音的字符串写入文件中,它奏效了!

I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!

string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)

with  open('./test.txt', 'w') as outfile:
    outfile.write(test_decode)

但是 apache_beam 没有运气...

But no luck with apache_beam...

然后我尝试手动更改/usr/lib/python2.7/encodings/utf_8.py"并将ignore"而不是strict" 到 codecs.utf_8_decode

Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode

def decode(input, errors='ignore'):
    return codecs.utf_8_decode(input, errors, True)

但我已经意识到 apache_beam 不使用这个文件,或者至少不考虑它的任何变化

but I've realized that apache_beam do not use this file or at least does not take it into account any changes

有什么处理方法吗?

请在下面找到错误信息

Traceback (most recent call last):
  File "etablissementsFiness.py", line 146, in <module>
    dataflow(run_locally)
  File "etablissementsFiness.py", line 140, in dataflow
    | 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
  File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
    self.run().wait_until_finish()
  File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    def start(self):
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    with self.scoped_start_state:
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    with self.spec.source.reader() as reader:
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    for value in reader:
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
    yield self._coder.decode(record)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
    return value.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte

推荐答案

尝试编写 CustomCoder 类并在解码时忽略"任何错误:

Try to write a CustomCoder class and "ignore" any errors while decoding:

from apache_beam.coders.coders import Coder

class CustomCoder(Coder):
    """A custom coder used for reading and writing strings as UTF-8."""

    def encode(self, value):
        return value.encode("utf-8", "replace")

    def decode(self, value):
        return value.decode("utf-8", "ignore")

    def is_deterministic(self):
        return True

然后,使用coder=CustomCoder()读写文件:

lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())

# More processing code here...

output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())

这篇关于Apache Beam 2.7.0 在 utf-8 解码法语字符中出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆