如何从 pandas 数据框创建语料库以与NLTK一起使用 [英] How to create corpus from pandas data frame to operate with NLTK

查看：95 发布时间：2020/5/18 1:24:00 python-3.x pandas nltk

本文介绍了如何从 pandas 数据框创建语料库以与NLTK一起使用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的问题:

我有一个csv文件，其中包含带有ID，CATEGORY，TITLE，BODY列的商品数据集.

I have a csv file containing articles data set with columns: ID, CATEGORY, TITLE, BODY.

在python中，我将文件读取到熊猫数据框，如下所示:

In python, I read the file to a pandas data frame like this:

import pandas as pd
df = pd.read_csv('my_file.csv')

现在我需要以某种方式转换此df以获得语料库对象，我们将其称为my_corpus.但是我怎么能做到呢?我认为我需要使用:

Now I need to transform somehow this df to get a corpus object, let's call it my_corpus. But how exactly I can do it? I assume I need to use:

from nltk.corpus.reader import CategorizedCorpusReader
my_corpus = some_nltk_function(df) # <- what is the function?

最后，我可以使用NLTK方法来分析语料库.例如:

At the end I can use NLTK methods to analyze the corpus. For example:

import nltk
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B', 'cat_C']) # <- I expect values from column TITLE and BODY

请提出建议.

推荐答案

我想您需要做2件事.

I guess you need to do 2 things.

首先，您需要将数据框 df 的每一行转换为语料库文件.以下功能应为您完成

First you need to convert each row of your dataframe df to corpus files. The following function should do it for you

def CreateCorpusFromDataFrame(corpusfolder,df):
    for index, r in df.iterrows():
        id=r['ID']
        title=r['TITLE']
        body=r['BODY']
        category=r['CATEGORY']
        fname=str(category)+'_'+str(id)+'.txt'
        corpusfile=open(corpusfolder+'/'+fname,'a')
        corpusfile.write(str(body) +" " +str(title))
        corpusfile.close()

CreateCorpusFromDataFrame('yourcorpusfolder/',df)

第二，您需要从 yourcorpusfolder 中读取文件，然后进行您所需的NLTK处理

Second, you need to read the files from yourcorpusfolder and then do the NLTK processing required by you

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
my_corpus=CategorizedPlaintextCorpusReader('yourcorpusfolder/',
r'.*', cat_pattern=r'(.*)_.*') 
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B']) # <- I expect values from column TITLE and BODY

一些有用的参考文献:

https://groups.google.com/forum/# ！topic/nltk-users/YFCKjHbpUkY

https://groups.google.com/forum/#!topic/nltk-users/YFCKjHbpUkY
Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

这篇关于如何从 pandas 数据框创建语料库以与NLTK一起使用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从 pandas 数据框创建语料库以与NLTK一起使用 [英] How to create corpus from pandas data frame to operate with NLTK

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从 pandas 数据框创建语料库以与NLTK一起使用 [英] How to create corpus from pandas data frame to operate with NLTK

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭