如何基于部分匹配选择DataFrame列? [英] How to select DataFrame columns based on partial matching?

查看:126
本文介绍了如何基于部分匹配选择DataFrame列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天下午,我正在努力寻找一种方法,通过检查其名称(标签?)中某种模式的出现来选择我的Pandas DataFrame的几列.

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).

我一直在为nd.arrays/pd.series寻找类似containsisin的东西,但是没有运气.

I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.

这让我很沮丧,因为我已经在检查DataFrame的列中是否出现了特定的字符串模式,例如:

This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:

hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]

但是,无论我如何敲打头,我都无法将.str.contains()应用于df.columns返回的对象-它是Index-也不应用df.columns.values返回的对象-.对于切片"操作df[column_name]返回的内容(即Series),此方法效果很好.

However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.

我的第一个解决方案涉及一个for循环和一个帮助列表的创建:

My first solution involved a for loop and the creation of a help list:

ll = []
for a in df.columns:
    if a.startswith('start_exp1') | a.startswith('start_exp2'):
    ll.append(a)
df[ll]

(当然,任何人都可以应用任何str函数)

(one could apply any of the str functions, of course)

然后,我找到了map函数,并使其与以下代码一起使用:

Then, I found the map function and got it to work with the following code:

import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]

当然,在第一个解决方案中,我可以执行相同类型的正则表达式检查,因为我可以将其应用于迭代返回的str数据类型.

Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.

我对Python还是很陌生,从来没有真正编程过任何东西,所以我对速度/定时/效率不太熟悉,但是我倾向于认为第二种方法-使用地图-除了看起来更优雅之外,可能会更快到我未经训练的眼睛.

I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.

我很好奇您对它的想法以及可能的替代方案.考虑到我的粗暴程度,如果您能纠正我在代码中可能犯的任何错误并为我指出正确的方向,我将不胜感激.

I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.

谢谢, 米歇尔

编辑:我刚刚找到了Index方法Index.to_series(),该方法返回-ehm-一个我可以应用.str.contains('whatever')Series. 但是,这不像真正的正则表达式那么强大,而且我找不到将Index.to_series().str的结果传递给re.search()函数的方法.

EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever'). However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..

推荐答案

您使用map的解决方案非常好.如果您确实要使用str.contains,则可以将Index对象转换为Series(具有str.contains方法):

Your solution using map is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains method):

In [1]: df
Out[1]: 
   x  y  z
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9

In [2]: df.columns.to_series().str.contains('x')
Out[2]: 
x     True
y    False
z    False
dtype: bool

In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]: 
   x
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

更新,我刚刚读了你的最后一段.从文档str.contains默认情况下允许您传递正则表达式(str.contains('^myregex'))

UPDATE I just read your last paragraph. From the documentation, str.contains allows you to pass a regex by default (str.contains('^myregex'))

这篇关于如何基于部分匹配选择DataFrame列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆