比较不同列的字符串长度的数据帧 [英] Comparing a dataframe on string lengths for different columns

查看：178 发布时间：2017/3/26 3:24:55 python pandas dataframe min string-length

本文介绍了比较不同列的字符串长度的数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图获取不同列的字符串长度。似乎很简单：

  df ['a']。str.len（）

但是我需要将其应用到多个列。然后得到最小的。

如下所示：

  df [['a'，' b'，'c']]。str.len（）。min

我知道上面的不用工作，但希望你能得到这个想法。列 a ， b ， c 都包含名称和我想要找回最短的名字。

同样因为巨大的数据，我避免创建其他列来保存大小。

解决方案

我认为你需要列表理解，因为 string 与系列（列）：

在['a'，'b'，'c']]中的col的[（df [col] .str.len（）。min（）]

另一个解决方案适用：

['a'，'b'，'c']]中的col的[df [col] .apply（len）.min（）]]

样本：

  df = pd.DataFrame（{'a'：['h'，'gg'，'yyy']，
'b'：['st'，'dsws'，'sw']，
'c'：['fffff'，''，'rr']，
'd'：[1,3,5]}）
 
 print（df）
 
abcd 
 0 h st fffff 1 
 1 gg dsws 3 
 2 yyy sw rr 5 
 
 print（[df [col] .str。 len（）。min（）for col in ['a'，'b'，'c']]）
 [1，2，0]

计时：

 ＃[3000行×4列] 
 df = pd.concat（[df] * 1000）.reset_index（drop = True）
 
在[17]中：％timeit （[df [col] .apply（len）.min（）for ['a'，'b'，'c']]）
 100循环，最好3：2.63 ms每循环
 
在[18]中：％timeit（[df [col] .str.len（）。min（）for col in ['a'，'b'，'c']]）
最慢的速度比最快的时间长4.12倍。这可能意味着正在缓存中间的结果。 
 100循环，最好3：2.88 ms每循环

strong>：

apply 更快，但不适用于无。
df = pd.DataFrame（{'a'：['h'，'gg'，'yyy ']， 'b'：[无，'dsws'，'sw']， 'c'：['fffff'，''，'rr']， 'd '：[1,3,5]}） print（df） abcd 0 h无fffff 1 1 gg dsws 3 2 yyy sw rr 5 print（[df [col] .apply（len）.min（）for ['a'，'b'，'c']]）

TypeError：类型为NoneType的对象没有len（） p>

print（[df [col] .str.len（）。min ['a'，'b'，'c']]） [1，2.0，0]
通过评论编辑：
#fail with None print（df [['a'，'b'，'c']]。applymap（len）.min（axis = 1）） 0 1 1 0 2 2 dtype：int64

#working with None print（df [['a'，'b'，'c']]。apply（lambda x：x.str.len（）。min （），轴= 1）） 0 1 1 0 2 2 dtype：int64

I am trying to get the string lengths for different columns. Seems quite straightforward with:
df['a'].str.len()
But I need to apply it to multiple columns. And then get the minimum on it.

Something like:
df[['a','b','c']].str.len().min
I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.

Also because of huge data, I am avoiding creating other columns to save on size.
解决方案
I think you need list comprehension, because string function works only with Series (column):
print ([df[col].str.len().min() for col in ['a','b','c']])
Another solution with apply:
print ([df[col].apply(len).min() for col in ['a','b','c']])
Sample:
df = pd.DataFrame({'a':['h','gg','yyy'], 'b':['st','dsws','sw'], 'c':['fffff','','rr'], 'd':[1,3,5]}) print (df) a b c d 0 h st fffff 1 1 gg dsws 3 2 yyy sw rr 5 print ([df[col].str.len().min() for col in ['a','b','c']]) [1, 2, 0]
Timings:
#[3000 rows x 4 columns] df = pd.concat([df]*1000).reset_index(drop=True) In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']]) 100 loops, best of 3: 2.63 ms per loop In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']]) The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached. 100 loops, best of 3: 2.88 ms per loop
Conclusion:

apply is faster, but not works with None.
df = pd.DataFrame({'a':['h','gg','yyy'], 'b':[None,'dsws','sw'], 'c':['fffff','','rr'], 'd':[1,3,5]}) print (df) a b c d 0 h None fffff 1 1 gg dsws 3 2 yyy sw rr 5 print ([df[col].apply(len).min() for col in ['a','b','c']])

TypeError: object of type 'NoneType' has no len()

print ([df[col].str.len().min() for col in ['a','b','c']]) [1, 2.0, 0]
EDIT by comment:
#fail with None print (df[['a','b','c']].applymap(len).min(axis=1)) 0 1 1 0 2 2 dtype: int64

#working with None print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1)) 0 1 1 0 2 2 dtype: int64

这篇关于比较不同列的字符串长度的数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比较不同列的字符串长度的数据帧 [英] Comparing a dataframe on string lengths for different columns

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

比较不同列的字符串长度的数据帧 [英] Comparing a dataframe on string lengths for different columns

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭