如何遍历pandas df列,查找字符串是否包含来自单独的pandas df列的任何字符串? [英] How to loop through pandas df column, finding if string contains any string from a separate pandas df column?

查看:113
本文介绍了如何遍历pandas df列,查找字符串是否包含来自单独的pandas df列的任何字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python 中有两个 Pandas DataFrame.DF A 包含一列,它基本上是句子长度的字符串.

I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings.

|---------------------|------------------|
|        sentenceCol  |    other column  |
|---------------------|------------------|
|'this is from france'|         15       |
|---------------------|------------------|

DF B 包含一列国家/地区列表

DF B contains a column that is a list of countries

|---------------------|------------------|
|        country      |    other column  |
|---------------------|------------------|
|'france'             |         33       |
|---------------------|------------------|
|'spain'              |         34       |
|---------------------|------------------|

如何遍历 DF A 并指定字符串包含的国家/地区?这就是我想象的 DF A 分配后的样子...

How can I loop through DF A and assign which country the string contains? Here's what I imagine DF A would look like after assignment...

|---------------------|------------------|-----------|
|        sentenceCol  |    other column  | country   |
|---------------------|------------------|-----------|
|'this is from france'|         15       |  'france' |
|---------------------|------------------|-----------|

一个额外的复杂因素是每个句子可以有多个国家,因此理想情况下,这可以将每个适用的国家分配给该句子.

One additional complication is that there can be more than one country per sentence, so ideally this could assign every applicable country to that sentence.

|-------------------------------|------------------|-----------|
|        sentenceCol            |    other column  | country   |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'france' |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'spain'  |
|-------------------------------|------------------|-----------|

推荐答案

这里不需要循环.循环数据帧很慢,我们优化了pandasnumpy 解决我们几乎所有问题的方法.

There's no need for a loop here. Looping over a dataframe is slow and we have optimized pandas or numpy methods for almost all of our problems.

在这种情况下,对于您的第一个问题,您正在寻找Series.str.extract:

In this case, for your first problem, you are looking for Series.str.extract:

dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")

           sentenceCol  other column country
0  this is from france            15  france

<小时>

对于您的第二个问题,您需要Series.str.extractallSeries.drop_duplicates &to_numpy:

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .drop_duplicates()
        .to_numpy()
)

                     sentenceCol  other column country
0  this is from france and spain            15  france
1  this is from france and spain            15   spain

<小时>

编辑

或者如果您的 sentenceCol 没有重复,我们必须将提取的值放到一行中.我们使用 GroupBy.agg:

Or if your sentenceCol is not duplicated, we have to get the extracted values to a single row. We use GroupBy.agg:

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .groupby(level=0)
        .agg(', '.join)
        .to_numpy()
)

                     sentenceCol  other column        country
0  this is from france and spain            15  france, spain

<小时>

Edit2

复制原始行.我们join数据帧回到我们的提取:

To duplicate the original rows. We join the dataframe back to our extraction:

extraction = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .rename(columns={0: 'country'})
)

dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)

  country                    sentenceCol  other column
0  france  this is from france and spain            15
1   spain  this is from france and spain            15

<小时>

使用的数据框:


Dataframes used:

dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
                   'other column':[15]*2})

dfb = pd.DataFrame({'country':['france', 'spain']})

这篇关于如何遍历pandas df列,查找字符串是否包含来自单独的pandas df列的任何字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆