如何遍历pandas df列,查找字符串是否包含来自单独的pandas df列的任何字符串? [英] How to loop through pandas df column, finding if string contains any string from a separate pandas df column?
问题描述
我在 python 中有两个 Pandas DataFrame.DF A 包含一列,它基本上是句子长度的字符串.
I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings.
|---------------------|------------------|
| sentenceCol | other column |
|---------------------|------------------|
|'this is from france'| 15 |
|---------------------|------------------|
DF B 包含一列国家/地区列表
DF B contains a column that is a list of countries
|---------------------|------------------|
| country | other column |
|---------------------|------------------|
|'france' | 33 |
|---------------------|------------------|
|'spain' | 34 |
|---------------------|------------------|
如何遍历 DF A 并指定字符串包含的国家/地区?这就是我想象的 DF A 分配后的样子...
How can I loop through DF A and assign which country the string contains? Here's what I imagine DF A would look like after assignment...
|---------------------|------------------|-----------|
| sentenceCol | other column | country |
|---------------------|------------------|-----------|
|'this is from france'| 15 | 'france' |
|---------------------|------------------|-----------|
一个额外的复杂因素是每个句子可以有多个国家,因此理想情况下,这可以将每个适用的国家分配给该句子.
One additional complication is that there can be more than one country per sentence, so ideally this could assign every applicable country to that sentence.
|-------------------------------|------------------|-----------|
| sentenceCol | other column | country |
|-------------------------------|------------------|-----------|
|'this is from france and spain'| 16 | 'france' |
|-------------------------------|------------------|-----------|
|'this is from france and spain'| 16 | 'spain' |
|-------------------------------|------------------|-----------|
推荐答案
这里不需要循环.循环数据帧很慢,我们优化了pandas
或numpy
解决我们几乎所有问题的方法.
There's no need for a loop here. Looping over a dataframe is slow and we have optimized pandas
or numpy
methods for almost all of our problems.
在这种情况下,对于您的第一个问题,您正在寻找Series.str.extract
:
In this case, for your first problem, you are looking for Series.str.extract
:
dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")
sentenceCol other column country
0 this is from france 15 france
<小时>
对于您的第二个问题,您需要Series.str.extractall
和 Series.drop_duplicates
&to_numpy
一个>:
dfa['country'] = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.drop_duplicates()
.to_numpy()
)
sentenceCol other column country
0 this is from france and spain 15 france
1 this is from france and spain 15 spain
<小时>
编辑
或者如果您的 sentenceCol
没有重复,我们必须将提取的值放到一行中.我们使用 GroupBy.agg
:
Or if your sentenceCol
is not duplicated, we have to get the extracted values to a single row. We use GroupBy.agg
:
dfa['country'] = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.groupby(level=0)
.agg(', '.join)
.to_numpy()
)
sentenceCol other column country
0 this is from france and spain 15 france, spain
<小时>
Edit2
复制原始行.我们join
数据帧回到我们的提取:
To duplicate the original rows. We join
the dataframe back to our extraction:
extraction = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.rename(columns={0: 'country'})
)
dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)
country sentenceCol other column
0 france this is from france and spain 15
1 spain this is from france and spain 15
<小时>
使用的数据框:
Dataframes used:
dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
'other column':[15]*2})
dfb = pd.DataFrame({'country':['france', 'spain']})
这篇关于如何遍历pandas df列,查找字符串是否包含来自单独的pandas df列的任何字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!