将 pandas 数据帧有效地转换为水帧 [英] conversion of pandas dataframe to h2o frame efficiently

查看:113
本文介绍了将 pandas 数据帧有效地转换为水帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,该数据框具有Encoding: latin-1,并由;分隔.数据帧非常大,几乎为size: 350000 x 3800.我最初想使用sklearn,但是我的数据框缺少值(NAN values),所以我无法使用sklearn的随机森林或GBM.因此,我必须使用H2O's分布式随机森林进行数据集的训练.主要问题是当我执行h2o.H2OFrame(data)时,数据帧没有有效转换.我检查了提供编码选项的可能性,但文档中没有任何内容.

I have a Pandas dataframe which has Encoding: latin-1 and is delimited by ;. The dataframe is very large almost of size: 350000 x 3800. I wanted to use sklearn initially but my dataframe has missing values (NAN values) so i could not use sklearn's random forests or GBM. So i had to use H2O's Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data). I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.

有人对此有想法吗?任何线索都可以帮助我.我还想知道是否还有其他像H2O这样的库可以非常有效地处理NAN值?我知道我们可以估算列,但是我不应该在我的数据集中进行估算,因为我的列是来自不同传感器的值,如果这些值不存在,则表明该传感器不存在.我只能使用Python

Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python

推荐答案

import h2o
import pandas as pd

df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)

由于您面临的问题是由于数据集中大量的NAN,因此应首先处理该问题.有两种方法.

Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so.

  1. 用一个明显超出范围的值替换NAN. 前任.如果某个功能在0-1之间变化,则将所有NAN都替换为-1.

  1. Replace NAN with a single, obviously out-of-range value. Ex. If a feature varies between 0-1 replace all NAN with -1 for that feature.

使用类 Imputer 来处理NAN值.这会将NAN替换为该特征的均值,中位数或众数.

Use the class Imputer to handle NAN values. This will replace NAN with either of mean, median or mode of that feature.

这篇关于将 pandas 数据帧有效地转换为水帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆