如何获得数据框中所有唯一的单词? [英] How to get all the unique words in the data frame?
问题描述
我有一个带有产品列表及其相应评论的数据框
I have a dataframe with a list of products and its respective review
+ --------- + -------- ---------------------------------------- +
|产品|评论|
+ --------- + ------------------------------ ------------------ +
| product_a |休闲午餐很好|
+ --------- + ------------------------- ----------------------- +
| product_b |艾利(Avery)是最知名的咖啡师之一|
+ --------- + ----------------------- ------------------------- +
| product_c |导游告诉我们的秘密|
+ --------- + ------------------------ ------------------------ +
+---------+------------------------------------------------+
| product | review |
+---------+------------------------------------------------+
| product_a | It's good for a casual lunch |
+---------+------------------------------------------------+
| product_b | Avery is one of the most knowledgable baristas |
+---------+------------------------------------------------+
| product_c | The tour guide told us the secrets |
+---------+------------------------------------------------+
如何获取数据框?
我做了一个函数:
def count_words(text):
try:
text = text.lower()
words = text.split()
count_words = Counter(words)
except Exception, AttributeError:
count_words = {'':0}
return count_words
并应用
And applied the function to the DataFrame, but that only gives me the words count for each row.
reviews['words_count'] = reviews['review'].apply(count_words)
推荐答案
dfx
review
0 United Kingdom
1 The United Kingdom
2 Dublin, Ireland
3 Mardan, Pakistan
要获取评论列中的所有单词:
To get all words in the "review" column:
list(dfx['review'].str.split(' ', expand=True).stack().unique())
['United', 'Kingdom', 'The', 'Dublin,', 'Ireland', 'Mardan,', 'Pakistan']
要获取评论列的计数:
dfx['review'].str.split(' ', expand=True).stack().value_counts()
United 2
Kingdom 2
Mardan, 1
The 1
Ireland 1
Dublin, 1
Pakistan 1
dtype: int64
这篇关于如何获得数据框中所有唯一的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!