左对齐 Pandas DataFrame 中的字符串值 [英] Left justify string values in a pandas DataFrame

查看:85
本文介绍了左对齐 Pandas DataFrame 中的字符串值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个包含 180000 多个值的 DataFrame,我需要 (1) 按行替换单元格中的重复值和某些值,以及 (2) 重新排列.这是我的 DataFrame,df:

So I have a DataFrame with 180000+ values and I need to (1) replace duplicate and certain values in cells by row and (2) rearrange. Here is my DataFrame, df:

    key   sellyr  brand  makrc  item1  item2  item3  item4  item5  item6
0   da12  2013    imp    apt    furi   apt    nan    nan    nan    nan
1   da32  2013    sa     rye    rye    app    nan    nan    nan    nan 
2   da14  2013    sa     pro    not    pro    pan    fan    nan    nan
........

nan 值代表 np.nan.并且禁止的字符串是'not'.

nan values represent np.nan. And forbidden string is 'not'.

所以我需要做的是检查列 item1~6 用 nan 替换 makrc 列中包含的字符串.同样,我也想用 nan's 替换 'not's.将字符串替换为 np.nan 后,我需要重新排列 item1~6 以将非 nan 数据左对齐到最左边的空单元格,如下所示,(预期输出):

So what I need to do is check columns item1~6 replace strings that are contained in the makrc column with nan. As well, I also want to replace 'not's' with nan's. After replacing strings to np.nan, I need to rearrange the item1~6 to left justify non-nan data to the leftmost empty cell, as shown below, (expected output):

    key   sellyr  brand  makrc  item1  item2  item3  item4  item5  item6
0   da12  2013    imp    apt    furi   nan    nan    nan    nan    nan
1   da32  2013    sa     rye    app    nan    nan    nan    nan    nan 
2   da14  2013    sa     pro    pan    fan    nan    nan    nan    nan
........

正如您在第一个索引中看到的那样,我删除了 item2 中的 apt 字符串并更改为 np.nan 因为相同的字符串在 makrc 列中.在索引 1 中,我删除了黑麦并替换为 np.nan.但这一次,我将app"字符串从 item2 重新排列为 item1,因为 np.nan 值应该在值之后.在索引 2 中,我替换了 pro 而不是因为我需要将项目列中的每个not"字符串替换为 np.nan.我也重新排列了项目.

So as you can see in a first index, I have removed apt string in item2 and changed to np.nan because same string is in makrc column. In index 1, I have removed rye and replace with np.nan. But this time, I rearranged the 'app' string from item2 to item1 because np.nan values should come after the values. In index 2, I have replaced pro and not since I need to replace every 'not'string in the item columns to np.nan. Also I have rearranged the items.

我尝试将所有项目列组合为一个列表并替换它,但有几行只有 np.nan 项目.你们能推荐一个理想的过程来解决我的问题吗?非常感谢.

I've tried combining all item columns as a list and replacing it, but there are a few rows with only np.nan items. Can you guys recommend an ideal process to solve my problem? Thank you so much.

推荐答案

首先提取以item开头的一段列 -

First, extract a slice of columns beginning with item -

m = df.columns.str.contains('item')
i = df.iloc[:, m]

屏蔽所有符合您标准的值.使用 isin -

Mask all values which meet your criteria. Use isin -

j = i[~i.isin(df.makrc.tolist() + ['not'])]

现在.根据 NaN 对值进行排序并分配回 -

Now. sort values based on NaNs and assign back -

df.loc[:, m] = j.apply(sorted, key=pd.isnull, axis=1)
df

    key  sellyr brand makrc item1 item2  item3  item4  item5  item6
0  da12    2013   imp   apt  furi   NaN    NaN    NaN    NaN    NaN
1  da32    2013    sa   rye   app   NaN    NaN    NaN    NaN    NaN
2  da14    2013    sa   pro   pan   fan    NaN    NaN    NaN    NaN

<小时>

详情

i

  item1 item2 item3 item4  item5  item6
0  furi   apt   NaN   NaN    NaN    NaN
1   rye   app   NaN   NaN    NaN    NaN
2   not   pro   pan   fan    NaN    NaN

j

  item1 item2 item3 item4  item5  item6
0  furi   NaN   NaN   NaN    NaN    NaN
1   NaN   app   NaN   NaN    NaN    NaN
2   NaN   NaN   pan   fan    NaN    NaN

<小时>

实现更好的性能

您可以使用 Divakar 的修改版justified 适用于对象数组的函数 -

You could make use of a modified version of Divakar's justified function that works on object arrays -

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = pd.notnull(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val, dtype=object) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

df.loc[:, m] = justify(j.values, invalid_val=np.nan, axis=1, side='left')
df

    key  sellyr brand makrc item1 item2  item3  item4  item5  item6
0  da12    2013   imp   apt  furi   NaN    NaN    NaN    NaN    NaN
1  da32    2013    sa   rye   app   NaN    NaN    NaN    NaN    NaN
2  da14    2013    sa   pro   pan   fan    NaN    NaN    NaN    NaN

这应该(希望)比调用 apply 更快.使用针对数字数据优化的函数的原始版本,您会特别看到速度提升.

This should (hopefully) be faster than calling apply. You'll especially see speed gains using the original version of the function that is optimised for numeric data.

这篇关于左对齐 Pandas DataFrame 中的字符串值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆