比较不同列的字符串长度的数据帧 [英] Comparing a dataframe on string lengths for different columns

查看:178
本文介绍了比较不同列的字符串长度的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图获取不同列的字符串长度。似乎很简单:

  df ['a']。str.len()

但是我需要将其应用到多个列。然后得到最小的。



如下所示:

  df [['a',' b','c']]。str.len()。min 

我知道上面的不用工作,但希望你能得到这个想法。列 a b c 都包含名称和我想要找回最短的名字。



同样因为巨大的数据,我避免创建其他列来保存大小。

解决方案

我认为你需要列表理解,因为 string 系列):

在['a','b','c']]中的col的[(df [col] .str.len()。min()] 

另一个解决方案适用



['a','b','c']]中的col的[df [col] .apply(len).min()]]

样本:

  df = pd.DataFrame({'a':['h','gg','yyy'],
'b':['st','dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})

print(df)

abcd
0 h st fffff 1
1 gg dsws 3
2 yyy sw rr 5

print([df [col] .str。 len()。min()for col in ['a','b','c']])
[1,2,0]

计时

 #[3000行×4列] 
df = pd.concat([df] * 1000).reset_index(drop = True)

在[17]中:%timeit ([df [col] .apply(len).min()for ['a','b','c']])
100循环,最好3:2.63 ms每循环

在[18]中:%timeit([df [col] .str.len()。min()for col in ['a','b','c']])
最慢的速度比最快的时间长4.12倍。这可能意味着正在缓存中间的结果。
100循环,最好3:2.88 ms每循环

strong>:



apply 更快,但不适用于

  df = pd.DataFrame({'a':['h','gg','yyy '],
'b':[无,'dsws','sw'],
'c':['fffff','','rr'],
'd ':[1,3,5]})


print(df)
abcd
0 h无fffff 1
1 gg dsws 3
2 yyy sw rr 5

print([df [col] .apply(len).min()for ['a','b','c']])




TypeError:类型为NoneType的对象没有len() p>



  print([df [col] .str.len()。min ['a','b','c']])
[1,2.0,0]

通过评论编辑:

  #fail with None 
print(df [['a','b','c']]。applymap(len).min(axis = 1))
0 1
1 0
2 2
dtype:int64






  #working with None 
print(df [['a','b','c']]。apply(lambda x:x.str.len()。min (),轴= 1))
0 1
1 0
2 2
dtype:int64


I am trying to get the string lengths for different columns. Seems quite straightforward with:

df['a'].str.len()

But I need to apply it to multiple columns. And then get the minimum on it.

Something like:

df[['a','b','c']].str.len().min

I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.

Also because of huge data, I am avoiding creating other columns to save on size.

解决方案

I think you need list comprehension, because string function works only with Series (column):

print ([df[col].str.len().min() for col in ['a','b','c']])

Another solution with apply:

print ([df[col].apply(len).min() for col in ['a','b','c']])

Sample:

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':['st','dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})

print (df)

     a     b      c  d
0    h    st  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]

Timings:

#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop

In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop

Conclusion:

apply is faster, but not works with None.

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':[None,'dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})


print (df)
     a     b      c  d
0    h  None  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].apply(len).min() for col in ['a','b','c']])

TypeError: object of type 'NoneType' has no len()

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]

EDIT by comment:

#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0    1
1    0
2    2
dtype: int64


#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0    1
1    0
2    2
dtype: int64

这篇关于比较不同列的字符串长度的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆