Python中H2O DataFrame的中文文本 [英] Chinese Text for H2O DataFrame in Python

查看:262
本文介绍了Python中H2O DataFrame的中文文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有中文文本的utf-8编码的csv文件。当我尝试导入为h2o数据框时,数据显示为乱码。

I have a utf-8 encoded csv file with Chinese text. When I tried to import as an h2o dataframe, the data is improperly displayed as gibberish.

 dataframe = h2o.import_file('test.csv')

在结果数据框中,列名是正确的,但不是中文文本,它会显示以下文本:

In the resulting dataframe, the column names are correct, but instead of Chinese text, it displays text like this:

 在ç�¡è¦ºäº†ä½ 知é�

我查看了h2o文档,似乎没有任何方法可以像使用import_file那样在pandas中设置编码选项。此外,在运行以下命令时:

I looked into h2o documentation and there doesn't seem to be any way to set an encoding option like in pandas when using import_file. Further, when running the following:

testing = ['你','好','嗎']
h2o.H2OFrame(testing)

它给出了此错误:

--------------------------------------------------------------------------
 UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-2-5f4b3eb49a84> in <module>
      1 testing = ['你','好','嗎']
----> 2 h2o.H2OFrame(testing)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in __init__(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
    104         if python_obj is not None:
    105             self._upload_python_object(python_obj, 
destination_frame, header, separator,
--> 106                                        column_names, 
column_types, na_strings, skipped_columns)
    107 
    108     @staticmethod

~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in _upload_python_object(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
    143             csv_writer.writerow([row.get(k, None) for k in col_header])
    144         else:
--> 145             csv_writer.writerows(data_to_write)
    146         tmp_file.close()  # close the streams
    147         self._upload_parse(tmp_path, destination_frame, 1, 
separator, column_names, column_types, na_strings, skipped_columns)

~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\u4f60' in position 1: character maps to <undefined>

基于此错误,似乎h2o正在使用cp1252编码。有人可以提供帮助让h2o用中文以utf-8编码导入csv文件吗?谢谢。

Based on this error, it seems that cp1252 encoding is being used by h2o. Can someone offer help to have h2o import the csv file with Chinese to be in utf-8 encoding? Thank you.

推荐答案

注释中的jira标签已解决,此解析问题不再是较新版本的问题的水。我的建议是升级-例如,如果您升级到最新版本的H2O,则应该没有任何问题。

The jira ticket in the comments has been resolved, and this parsing issue is no longer an issue with newer version of H2O. My recommendation would be to upgrade - for example if you upgrade to latest version of H2O you shouldn't have any issues.

我用3.22.0.2版进行了测试您的示例得到了:

I did a test with version 3.22.0.2 with your example and got:

In [6]: h2o.H2OFrame(testing)
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[6]:
C1
----
你
好
嗎

[3 rows x 1 column]

这篇关于Python中H2O DataFrame的中文文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆