Python中H2O DataFrame的中文文本 [英] Chinese Text for H2O DataFrame in Python
问题描述
我有一个带有中文文本的utf-8编码的csv文件。当我尝试导入为h2o数据框时,数据显示为乱码。
I have a utf-8 encoded csv file with Chinese text. When I tried to import as an h2o dataframe, the data is improperly displayed as gibberish.
dataframe = h2o.import_file('test.csv')
在结果数据框中,列名是正确的,但不是中文文本,它会显示以下文本:
In the resulting dataframe, the column names are correct, but instead of Chinese text, it displays text like this:
在ç�¡è¦ºäº†ä½ 知é�
我查看了h2o文档,似乎没有任何方法可以像使用import_file那样在pandas中设置编码选项。此外,在运行以下命令时:
I looked into h2o documentation and there doesn't seem to be any way to set an encoding option like in pandas when using import_file. Further, when running the following:
testing = ['你','好','嗎']
h2o.H2OFrame(testing)
它给出了此错误:
--------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-2-5f4b3eb49a84> in <module>
1 testing = ['你','好','嗎']
----> 2 h2o.H2OFrame(testing)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in __init__(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
104 if python_obj is not None:
105 self._upload_python_object(python_obj,
destination_frame, header, separator,
--> 106 column_names,
column_types, na_strings, skipped_columns)
107
108 @staticmethod
~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in _upload_python_object(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
143 csv_writer.writerow([row.get(k, None) for k in col_header])
144 else:
--> 145 csv_writer.writerows(data_to_write)
146 tmp_file.close() # close the streams
147 self._upload_parse(tmp_path, destination_frame, 1,
separator, column_names, column_types, na_strings, skipped_columns)
~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\u4f60' in position 1: character maps to <undefined>
基于此错误,似乎h2o正在使用cp1252编码。有人可以提供帮助让h2o用中文以utf-8编码导入csv文件吗?谢谢。
Based on this error, it seems that cp1252 encoding is being used by h2o. Can someone offer help to have h2o import the csv file with Chinese to be in utf-8 encoding? Thank you.
推荐答案
注释中的jira标签已解决,此解析问题不再是较新版本的问题的水。我的建议是升级-例如,如果您升级到最新版本的H2O,则应该没有任何问题。
The jira ticket in the comments has been resolved, and this parsing issue is no longer an issue with newer version of H2O. My recommendation would be to upgrade - for example if you upgrade to latest version of H2O you shouldn't have any issues.
我用3.22.0.2版进行了测试您的示例得到了:
I did a test with version 3.22.0.2 with your example and got:
In [6]: h2o.H2OFrame(testing)
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[6]:
C1
----
你
好
嗎
[3 rows x 1 column]
这篇关于Python中H2O DataFrame的中文文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!