如何根据文档相似性对文本数据进行分组? [英] How to group text data based on document similarity?
问题描述
考虑如下所示的数据框
df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
'How are you doing?' ]})
Questions
0 What are you doing?
1 What are you doing tonight?
2 What are you doing now?
3 What is your name?
4 What is your nick name?
5 What is your full name?
6 Shall we meet?
7 How are you doing?
如何使用类似的问题对数据框进行分组?即如何获得如下
How to group the dataframe with similar Questions? i.e how to get groups like below
for _, i in df.groupby('similarity')['Questions']:
print(i,'\n')
6 Shall we meet?
Name: Questions, dtype: object
3 What is your name?
4 What is your nick name?
5 What is your full name?
Name: Questions, dtype: object
0 What are you doing?
1 What are you doing tonight?
2 What are you doing now?
7 How are you doing?
Name: Questions, dtype: object
在此处提出了类似的问题,但其内容不太清楚,因此没有提出要求对于这个问题
A similar question was asked here but with less clarity so no aswers for that question
推荐答案
这是一个相当大的方法,即在系列中所有元素之间找到normalized similarity score
,然后根据新获得的相似性列表(将其转换为字符串)将它们分组.即
Here's one pretty big approach by finding the normalized similarity score
between all the elements in the series and then grouping them by the newly obtained similarity list converted to string. i.e
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
def doc_to_synsets(doc):
"""
Returns a list of synsets in document.
Tokenizes and tags the words in the document doc.
Then finds the first synset for each word/tag combination.
If a synset is not found for that combination it is skipped.
Args:
doc: string to be converted
Returns:
list of synsets
Example:
doc_to_synsets('Fish are nvqjp friends.')
Out: [Synset('fish.n.01'), Synset('be.v.01'),
Synset('friend.n.01')]
"""
synsetlist =[]
tokens=nltk.word_tokenize(doc)
pos=nltk.pos_tag(tokens)
for tup in pos:
try:
synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
except:
continue
return synsetlist
def similarity_score(s1, s2):
"""
Calculate the normalized similarity score of s1 onto s2
For each synset in s1, finds the synset in s2 with the largest similarity value.
Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found.
Args:
s1, s2: list of synsets from doc_to_synsets
Returns:
normalized similarity score of s1 onto s2
Example:
synsets1 = doc_to_synsets('I like cats')
synsets2 = doc_to_synsets('I like dogs')
similarity_score(synsets1, synsets2)
Out: 0.73333333333333339
"""
highscores = []
for synset1 in s1:
highest_yet=0
for synset2 in s2:
try:
simscore=synset1.path_similarity(synset2)
if simscore>highest_yet:
highest_yet=simscore
except:
continue
if highest_yet>0:
highscores.append(highest_yet)
return sum(highscores)/len(highscores) if len(highscores) > 0 else 0
def document_path_similarity(doc1, doc2):
synsets1 = doc_to_synsets(doc1)
synsets2 = doc_to_synsets(doc2)
return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2
def similarity(x,df):
sim_score = []
for i in df['Questions']:
sim_score.append(document_path_similarity(x,i))
return sim_score
通过上面定义的方法,我们现在可以做
From the above methods defined we can now do
df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)
for _, i in df.groupby('similarity')['Questions']:
print(i,'\n')
输出:
6 Shall we meet?
Name: Questions, dtype: object
3 What is your name?
4 What is your nick name?
5 What is your full name?
Name: Questions, dtype: object
0 What are you doing?
1 What are you doing tonight?
2 What are you doing now?
7 How are you doing?
Name: Questions, dtype: object
这不是解决问题的最佳方法,而且确实很慢.任何新方法都受到高度赞赏.
This isn't the best approach to the problem, and is really slow. Any new approach is highly appreciated.
这篇关于如何根据文档相似性对文本数据进行分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!