如何基于部分匹配选择DataFrame列? [英] How to select DataFrame columns based on partial matching?
问题描述
今天下午,我正在努力寻找一种方法,通过检查其名称(标签?)中某种模式的出现来选择我的Pandas DataFrame的几列.
I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).
我一直在为nd.arrays
/pd.series
寻找类似contains
或isin
的东西,但是没有运气.
I had been looking for something like contains
or isin
for nd.arrays
/ pd.series
, but got no luck.
这让我很沮丧,因为我已经在检查DataFrame
的列中是否出现了特定的字符串模式,例如:
This frustrated me quite a bit, as I was already checking the columns of my DataFrame
for occurrences of specific string patterns, as in:
hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]
但是,无论我如何敲打头,我都无法将.str.contains()
应用于df.columns
返回的对象-它是Index
-也不应用df.columns.values
返回的对象-df[column_name]
返回的内容(即Series
),此方法效果很好.
However, no matter how I banged my head, I could not apply .str.contains()
to the object returned bydf.columns
- which is an Index
- nor the one returned by df.columns.values
- which is an ndarray
. This works fine for what is returned by the "slicing" operation df[column_name]
, i.e. a Series
, though.
我的第一个解决方案涉及一个for
循环和一个帮助列表的创建:
My first solution involved a for
loop and the creation of a help list:
ll = []
for a in df.columns:
if a.startswith('start_exp1') | a.startswith('start_exp2'):
ll.append(a)
df[ll]
(当然,任何人都可以应用任何str
函数)
(one could apply any of the str
functions, of course)
然后,我找到了map
函数,并使其与以下代码一起使用:
Then, I found the map
function and got it to work with the following code:
import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]
当然,在第一个解决方案中,我可以执行相同类型的正则表达式检查,因为我可以将其应用于迭代返回的str
数据类型.
Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str
data type returned by the iteration.
我对Python还是很陌生,从来没有真正编程过任何东西,所以我对速度/定时/效率不太熟悉,但是我倾向于认为第二种方法-使用地图-除了看起来更优雅之外,可能会更快到我未经训练的眼睛.
I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.
我很好奇您对它的想法以及可能的替代方案.考虑到我的粗暴程度,如果您能纠正我在代码中可能犯的任何错误并为我指出正确的方向,我将不胜感激.
I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.
谢谢, 米歇尔
编辑:我刚刚找到了Index
方法Index.to_series()
,该方法返回-ehm-一个我可以应用.str.contains('whatever')
的Series
.
但是,这不像真正的正则表达式那么强大,而且我找不到将Index.to_series().str
的结果传递给re.search()
函数的方法.
EDIT : I just found the Index
method Index.to_series()
, which returns - ehm - a Series
to which I could apply .str.contains('whatever')
.
However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str
to the re.search()
function..
推荐答案
您使用map
的解决方案非常好.如果您确实要使用str.contains,则可以将Index对象转换为Series(具有str.contains
方法):
Your solution using map
is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains
method):
In [1]: df
Out[1]:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
In [2]: df.columns.to_series().str.contains('x')
Out[2]:
x True
y False
z False
dtype: bool
In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]:
x
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
更新,我刚刚读了你的最后一段.从文档,str.contains
默认情况下允许您传递正则表达式(str.contains('^myregex')
)
UPDATE I just read your last paragraph. From the documentation, str.contains
allows you to pass a regex by default (str.contains('^myregex')
)
这篇关于如何基于部分匹配选择DataFrame列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!