如何从 Pandas 字典中存在的数据框列中删除单词 [英] How to delete words from a dataframe column that are present in dictionary in Pandas

查看:65
本文介绍了如何从 Pandas 字典中存在的数据框列中删除单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

扩展到:从字符串中删除单词列表

我有以下数据框,我想从 df.name 列中删除频繁出现的词:

I have following dataframe and I want to delete frequently occuring words from df.name column:

df:

name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh  
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark

我正在使用以下代码创建一个包含单词及其频率的新数据框:

I'm creating a new dataframe with words and their frequency with following code :

df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]

这将导致

df2:

word    freq
Clinton 4
Bill    3
James   3
Clark   3

然后我将其转换为带有以下代码片段的字典:

Then I'm converting it into a dictionary with following code snippet :

    d = dict(zip(df['word'], df['freq']))

现在,如果我必须从 df.name 中删除 d(这是字典,单词为:freq)中的单词,我将使用以下代码片段:

Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :

def check_thresh_word(merc,d):
    m = merc.split(' ')
    for i in range(len(m)):
            if m[i] in d.keys():
                return False
    else:
        return True

def rm_freq_occurences(merc,d):
    if check_thresh_word(merc,d) == False:
        nwords = merc.split(' ')
        rwords = [word for word in nwords if word not in d.keys()]
        m = ' '.join(rwords)
    else:
        m=merc
    return m

df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))

但实际上我的数据帧(df)包含近 240k 行,我必须使用大于 100 的阈值(上述示例中的阈值 = 3).所以由于复杂的搜索,上面的代码需要大量的时间来运行.有没有什么有效的方法可以让它更快?

But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100. So above code takes lots of time to run because of complex search. Is there any effiecient way to make it faster??

以下是所需的输出:

name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam

提前致谢!!!!!!!!!

Thanks in advance!!!!!!!

推荐答案

使用 replace 由正则表达式创建,由列 word 的所有值创建,最后 strip 跟踪空格:

Use replace by regex created by joined all values of column word, last strip traling whitespaces:

data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()

另一种解决方案是添加 \s* 以选择零个或多个空格:

Another solution is add \s* for select zero or more whitespaces:

pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*

data.name = data.name.replace(pat, '', regex=True)

<小时>

print (data)
          name
0       Hayden
1         Rock
2        Gates
3       Vishal
4     Cameroon
5        Micky
6      Michael
7   Tony Waugh
8          Tom
9          Tom
10     Avinash
11     Shreyas
12      Ramesh
13        Adam

这篇关于如何从 Pandas 字典中存在的数据框列中删除单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆