如何在 pandas 数据框中找到列的ngram频率? [英] How to find ngram frequency of a column in a pandas dataframe?

查看：138 发布时间：2020/5/18 0:36:56 pandas nlp scikit-learn nltk text-mining

本文介绍了如何在 pandas 数据框中找到列的ngram频率?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是我输入的熊猫数据框.

Below is the input pandas dataframe I have.

我想找到字母组合&的频率.二元组.

I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below

如何使用nltk或scikit学习?

How to do this using nltk or scikit learn?

我编写了以下代码，该代码将字符串作为输入.如何将其扩展到系列/数据框?

I wrote the below code which takes a string as input. How to extend it to series/dataframe?

from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()

推荐答案

如果您的数据是喜欢的

import pandas as pd
df = pd.DataFrame([
    'must watch. Good acting',
    'average movie. Bad acting',
    'good movie. Good acting',
    'pathetic. Avoid',
    'avoid'], columns=['description'])

您可以使用软件包sklearn的CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])

哪个给你:

                frequency
good            3
pathetic        1
average movie   1
movie bad       2
watch           1
good movie      1
watch good      3
good acting     2
must            1
movie good      2
pathetic avoid  1
bad acting      1
average         1
must watch      1
acting          1
bad             1
movie           1
avoid           1

编辑

fit只会训练"您的矢量化器:它将分割您的语料库的单词并使用它来创建词汇表.然后transform可以获取一个新文档并根据矢量化器词汇创建频率矢量.

fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.

在这里，您的训练集就是您的输出集，因此您可以同时执行这两个操作(fit_transform).因为您有5个文档，所以它将创建5个向量作为矩阵.您需要一个全局向量，所以必须创建一个sum.

Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.

编辑2

对于大数据帧，您可以使用以下方法来加快频率计算:

For big dataframes, you can speed up the frequencies computation by using:

frequencies = sum(sparse_matrix).data

这篇关于如何在 pandas 数据框中找到列的ngram频率?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 pandas 数据框中找到列的ngram频率? [英] How to find ngram frequency of a column in a pandas dataframe?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 pandas 数据框中找到列的ngram频率? [英] How to find ngram frequency of a column in a pandas dataframe?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭