如何在pandas DataFrame中的行之间标准化字符串? [英] How to standardize strings between rows in a pandas DataFrame?
问题描述
我在Python3.x中具有以下熊猫DataFrame:
I have the following pandas DataFrame in Python3.x:
import pandas as pd
dict1 = {
'ID':['first', 'second', 'third', 'fourth', 'fifth'],
'pattern':['AAABCDEE', 'ABBBBD', 'CCCDE', 'AA', 'ABCDE']
}
df = pd.DataFrame(dict1)
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD
2 third CCCDE
3 fourth AA
4 fifth ABCDE
有两列, ID
和模式
。 pattern
中最长的字符串位于第一行 len('AAABCDEE')
,即长度8.
There are two columns, ID
and pattern
. The string in pattern
with the longest length is in the first row, len('AAABCDEE')
, which is length 8.
我的目标是标准化字符串,使它们具有相同的长度,且后跟空格为?
。
My goal is to standardize the strings such that these are the same length, with the trailing spaces as ?
.
输出如下所示:
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD??
2 third CCCDE???
3 fourth AA??????
4 fifth ABCDE???
如果我能够将尾随空格设为 NaN
,那么我可以尝试以下操作:
If I was able to make the trailing spaces NaN
, then I could try something like:
df = df.applymap(lambda x: int(x) if pd.notnull(x) else str("?"))
但我不确定如何高效(1)在模式
中找到最长的字符串,然后(2)然后添加 NaN
将字符串的末尾加起来到这个长度?这可能是一种复杂的方法。
But I'm not sure how one would efficiently (1) find the longest string in pattern
and (2) then add NaN
add the end of the strings up to this length? This may be a convoluted approach...
推荐答案
您可以使用 Series.str.ljust $ c $为此,请在c>
中获取列中的最大字符串长度。
You can use Series.str.ljust
for this, after acquiring the max string length in the column.
df.pattern.str.ljust(df.pattern.str.len().max(), '?')
# 0 AAABCDEE
# 1 ABBBBD??
# 2 CCCDE???
# 3 AA??????
# 4 ABCDE???
# Name: pattern, dtype: object
在熊猫资源中 0.22.0
在此处可以看到,恰好
完全等同于 pad
和 side ='right'
,因此请选择您认为更清晰的那个。
In the source for Pandas 0.22.0
here it can be seen that ljust
is entirely equivalent to pad
with side='right'
, so pick whichever you find more clear.
这篇关于如何在pandas DataFrame中的行之间标准化字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!