如何从 Pandas 字典中存在的数据框列中删除单词 [英] How to delete words from a dataframe column that are present in dictionary in Pandas
问题描述
扩展到:从字符串中删除单词列表
我有以下数据框,我想从 df.name 列中删除频繁出现的词:
I have following dataframe and I want to delete frequently occuring words from df.name column:
df:
name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark
我正在使用以下代码创建一个包含单词及其频率的新数据框:
I'm creating a new dataframe with words and their frequency with following code :
df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]
这将导致
df2:
word freq
Clinton 4
Bill 3
James 3
Clark 3
然后我将其转换为带有以下代码片段的字典:
Then I'm converting it into a dictionary with following code snippet :
d = dict(zip(df['word'], df['freq']))
现在,如果我必须从 df.name 中删除 d(这是字典,单词为:freq)中的单词,我将使用以下代码片段:
Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :
def check_thresh_word(merc,d):
m = merc.split(' ')
for i in range(len(m)):
if m[i] in d.keys():
return False
else:
return True
def rm_freq_occurences(merc,d):
if check_thresh_word(merc,d) == False:
nwords = merc.split(' ')
rwords = [word for word in nwords if word not in d.keys()]
m = ' '.join(rwords)
else:
m=merc
return m
df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))
但实际上我的数据帧(df)包含近 240k 行,我必须使用大于 100 的阈值(上述示例中的阈值 = 3).所以由于复杂的搜索,上面的代码需要大量的时间来运行.有没有什么有效的方法可以让它更快?
But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100. So above code takes lots of time to run because of complex search. Is there any effiecient way to make it faster??
以下是所需的输出:
name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam
提前致谢!!!!!!!!!
Thanks in advance!!!!!!!
推荐答案
使用 replace
由正则表达式创建,由列 word
的所有值创建,最后 strip
跟踪空格:
Use replace
by regex created by joined all values of column word
, last strip
traling whitespaces:
data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()
另一种解决方案是添加 \s*
以选择零个或多个空格:
Another solution is add \s*
for select zero or more whitespaces:
pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*
data.name = data.name.replace(pat, '', regex=True)
<小时>
print (data)
name
0 Hayden
1 Rock
2 Gates
3 Vishal
4 Cameroon
5 Micky
6 Michael
7 Tony Waugh
8 Tom
9 Tom
10 Avinash
11 Shreyas
12 Ramesh
13 Adam
这篇关于如何从 Pandas 字典中存在的数据框列中删除单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!