如何让Pandas Python的HBase中不存储空值? [英] How to let null values are not stored in HBase in Pandas Python?

查看:110
本文介绍了如何让Pandas Python的HBase中不存储空值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些示例数据,如下所示:

I have some sample data as below:

    test_a      test_b   test_c   test_d   test_date
    -------------------------------------------------
1   a           500      0.1      111      20191101
2   a           NaN      0.2      NaN      20191102
3   a           200      0.1      111      20191103
4   a           400      NaN      222      20191104
5   a           NaN      0.2      333      20191105

我想让这些数据存储在Hbase中,我使用下面的代码来实现它.

I would like to let those data store in Hbase, and I use the below code to achieve it.

from test.db import impala, hbasecon, HiveClient
import pandas as pd

sql = """
    SELECT test_a
            ,test_b
            ,test_c
            ,test_d
            ,test_date
    FROM table_test
    """

conn_impa = HiveClient().getcon()
all_df = pd.read_sql(sql=sql, con=conn_impa, chunksize=50000)

num = 0

for df in all_df:
    df = df.fillna('')
    df["s"] = df["test_d"] + df["test_date"]
    tmp_num = len(df)
    if len(df) > 0:
        with hintltable.batch(batch_size=1000) as b:
            df.apply(lambda row: b.put(row["k"], {
                'test:test_a': str(row["test_a"]),
                'test:test_b': str(row["test_b"]),
                'test:test_c': str(row["test_c"]),
            }), axis=1)

            num += len(df)

当我在Hbase上查询 get'test','a201911012'时,得到以下结果:

When I query on Hbase get 'test', 'a201911012', I got below result:

COLUMN                           CELL                                                                                         
 test:test_a                      timestamp=1578389750838, value=a                                                              
 test:test_b                      timestamp=1578389788675, value=                                                              
 test:test_c                      timestamp=1578389775471, value=0.2                                                              
 test:test_d                      timestamp=1578449081388, value=                                                           

如何确保Pandas Python的HBase中不存储空值?我们不需要null或空字符串值,我们的预期结果是:

How to ensure null values are not stored in HBase in Pandas Python? We don't need null or empty string values, our expected result is:

COLUMN                           CELL                                                                                         
 test:test_a                      timestamp=1578389750838, value=a                                                                                                                       
 test:test_c                      timestamp=1578389775471, value=0.2                                                              

推荐答案

您应该可以通过创建自定义函数并在lambda函数中调用它来做到这一点.例如,您可能有一个功能-

You should be able to do this by creating a custom function and calling that in your lambda function. For example you could have a function -

def makeEntry(a, b, c):
    entrydict = {}
    ## using the fact that NaN == NaN is supposed to be False and empty strings are Falsy
    if(a==a and a):
        entrydict ["test:test_a"] = str(a)
    if(b==b and b):
        entrydict ["test:test_b"] = str(b)
    if(c==c and c):
        entrydict ["test:test_c"] = str(c)
    return entrydict

然后可以将apply函数更改为-

and then you could change your apply function to -

df.apply(lambda row: b.put(row["k"],
makeEntry(row["test_a"],row["test_b"],row["test_c"])), axis=1)

通过这种方式,您只输入了不是 NaN 的值,而不是所有值.

This way you only put in values that are not NaN instead of all values.

这篇关于如何让Pandas Python的HBase中不存储空值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆