左对齐 Pandas DataFrame 中的字符串值 [英] Left justify string values in a pandas DataFrame
问题描述
所以我有一个包含 180000 多个值的 DataFrame,我需要 (1) 按行替换单元格中的重复值和某些值,以及 (2) 重新排列.这是我的 DataFrame,df:
So I have a DataFrame with 180000+ values and I need to (1) replace duplicate and certain values in cells by row and (2) rearrange. Here is my DataFrame, df:
key sellyr brand makrc item1 item2 item3 item4 item5 item6
0 da12 2013 imp apt furi apt nan nan nan nan
1 da32 2013 sa rye rye app nan nan nan nan
2 da14 2013 sa pro not pro pan fan nan nan
........
nan 值代表 np.nan.并且禁止的字符串是'not'.
nan values represent np.nan. And forbidden string is 'not'.
所以我需要做的是检查列 item1~6 用 nan 替换 makrc 列中包含的字符串.同样,我也想用 nan's 替换 'not's.将字符串替换为 np.nan 后,我需要重新排列 item1~6 以将非 nan 数据左对齐到最左边的空单元格,如下所示,(预期输出):
So what I need to do is check columns item1~6 replace strings that are contained in the makrc column with nan. As well, I also want to replace 'not's' with nan's. After replacing strings to np.nan, I need to rearrange the item1~6 to left justify non-nan data to the leftmost empty cell, as shown below, (expected output):
key sellyr brand makrc item1 item2 item3 item4 item5 item6
0 da12 2013 imp apt furi nan nan nan nan nan
1 da32 2013 sa rye app nan nan nan nan nan
2 da14 2013 sa pro pan fan nan nan nan nan
........
正如您在第一个索引中看到的那样,我删除了 item2 中的 apt 字符串并更改为 np.nan 因为相同的字符串在 makrc 列中.在索引 1 中,我删除了黑麦并替换为 np.nan.但这一次,我将app"字符串从 item2 重新排列为 item1,因为 np.nan 值应该在值之后.在索引 2 中,我替换了 pro 而不是因为我需要将项目列中的每个not"字符串替换为 np.nan.我也重新排列了项目.
So as you can see in a first index, I have removed apt string in item2 and changed to np.nan because same string is in makrc column. In index 1, I have removed rye and replace with np.nan. But this time, I rearranged the 'app' string from item2 to item1 because np.nan values should come after the values. In index 2, I have replaced pro and not since I need to replace every 'not'string in the item columns to np.nan. Also I have rearranged the items.
我尝试将所有项目列组合为一个列表并替换它,但有几行只有 np.nan 项目.你们能推荐一个理想的过程来解决我的问题吗?非常感谢.
I've tried combining all item columns as a list and replacing it, but there are a few rows with only np.nan items. Can you guys recommend an ideal process to solve my problem? Thank you so much.
推荐答案
首先提取以item
开头的一段列 -
First, extract a slice of columns beginning with item
-
m = df.columns.str.contains('item')
i = df.iloc[:, m]
屏蔽所有符合您标准的值.使用 isin
-
Mask all values which meet your criteria. Use isin
-
j = i[~i.isin(df.makrc.tolist() + ['not'])]
现在.根据 NaN 对值进行排序并分配回 -
Now. sort values based on NaNs and assign back -
df.loc[:, m] = j.apply(sorted, key=pd.isnull, axis=1)
df
key sellyr brand makrc item1 item2 item3 item4 item5 item6
0 da12 2013 imp apt furi NaN NaN NaN NaN NaN
1 da32 2013 sa rye app NaN NaN NaN NaN NaN
2 da14 2013 sa pro pan fan NaN NaN NaN NaN
<小时>
详情
i
item1 item2 item3 item4 item5 item6
0 furi apt NaN NaN NaN NaN
1 rye app NaN NaN NaN NaN
2 not pro pan fan NaN NaN
j
item1 item2 item3 item4 item5 item6
0 furi NaN NaN NaN NaN NaN
1 NaN app NaN NaN NaN NaN
2 NaN NaN pan fan NaN NaN
<小时>
实现更好的性能
您可以使用 Divakar 的修改版justified
适用于对象数组的函数 -
You could make use of a modified version of Divakar's justified
function that works on object arrays -
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notnull(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
df.loc[:, m] = justify(j.values, invalid_val=np.nan, axis=1, side='left')
df
key sellyr brand makrc item1 item2 item3 item4 item5 item6
0 da12 2013 imp apt furi NaN NaN NaN NaN NaN
1 da32 2013 sa rye app NaN NaN NaN NaN NaN
2 da14 2013 sa pro pan fan NaN NaN NaN NaN
这应该(希望)比调用 apply
更快.使用针对数字数据优化的函数的原始版本,您会特别看到速度提升.
This should (hopefully) be faster than calling apply
. You'll especially see speed gains using the original version of the function that is optimised for numeric data.
这篇关于左对齐 Pandas DataFrame 中的字符串值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!