如何从数据框列中提取与列表的完全匹配? [英] How to extract exact matches with list from a dataframe column?

查看:45
本文介绍了如何从数据框列中提取与列表的完全匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含文本的大型数据框,我想用它来从单词列表(其中大约 1k 个单词)中查找匹配项.

I have a large dataframe with text that I want to use to find matches from a list of words (around 1k words in there).

我已经设法从数据框中的列表中获取该词的缺失/存在,但知道哪个词匹配对我来说也很重要.有时列表中的单词有多个完全匹配,我想全部都有.

I have managed to get the absence/presence of the word from the list in the dataframe, but it is also important to me to know which word matched. Sometimes there is exact match with more than one word from the list, I would like to have them all.

我尝试使用下面的代码,但它给了我部分匹配 - 音节而不是完整的单词.

I tried to use the code below, but it gives me partial matches - syllables instead of full words.

#this is a code to recreate the initial DF

import pandas as pd

df_data= [['orange','0'],
['apple and lemon','1'],
['lemon and orange','1']]

df= pd.DataFrame(df_data,columns=['text','match','exact word'])

初始 DF:

 text                 match
 orange               0
 apple and lemon      1
 lemon and orange     1

这是我需要匹配的单词列表

This is the list of words I need to match

 exactmatch = ['apple', 'lemon']

预期结果:

 text                    match  exact words
 orange                    0         0 
 apple and lemon           1        'apple','lemon'
 lemon and orange          1        'lemon'

这是我试过的:

# for some rows it gives me words I want, 
#and for some it gives me parts of the word

#regex attempt 1, gives me partial matches (syllables or single letters)

pattern1 = '|'.join(exactmatch)
df['contains'] = df['text'].str.extract("(" + "|".join(exactmatch) 
+")", expand=False)

#regex attempt 2 - this gives me an error - unexpected EOL

df['contains'] = df['text'].str.extractall
("(" + "|".join(exactmatch) +")").unstack().apply(','.join, 1)

#TypeError: ('sequence item 1: expected str instance, float found', 
#'occurred at index 2')

#no regex attempt, does not give me matches if the word is in there

lst = list(df['text'])
match = []
for w in lst:
 if w in exactmatch:
    match.append(w)
    break

推荐答案

使用 str.findall

例如:

exactmatch = ['apple', 'lemon']
df_data= [['orange'],['apple and lemon',],['lemon and orange'],]

df= pd.DataFrame(df_data,columns=['text'])
df['exact word'] = df["text"].str.findall(r"|".join(exactmatch)).apply(", ".join)
print(df)

输出:

               text    exact word
0            orange              
1   apple and lemon  apple, lemon
2  lemon and orange         lemon

这篇关于如何从数据框列中提取与列表的完全匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆