从大型语料库创建DTM [英] Create a DTM from large corpus

查看:104
本文介绍了从大型语料库创建DTM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个列表中包含的一组文本,这些文本是从csv文件加载的

I have a set of texts contained in a list, which I loaded from a csv file

texts=['this is text1', 'this would be text2', 'here we have text3']

,我想通过使用词干创建一个文档术语矩阵. 我也阻止了他们拥有:

and I would like to create a document-term matrix, by using stemmed words. I have also stemmed them to have:

[['text1'], ['would', 'text2'], ['text3']]

我想做的是创建一个对所有词根计数的DTM(然后我需要对行进行一些操作).

What I would like to do is to create a DTM that counts all the stemmed terms (then I would need to do some operations on the rows).

对于涉及无梗文本的问题,我可以使用报告

For what concerns the unstemmed texts, I am able to make the DTM for short texts, by using the function fn_tdm_df reported here. What would be more practical for me, though, is to make a DTM of the stemmed words. Just to be clearer, the output I have from applying "fn_tdm_df":

  be  have  here   is  text1  text2  text3  this   we  would
0  1.0   1.0   1.0  1.0    1.0    1.0    1.0     1  1.0    1.0
1  0.0   0.0   0.0  0.0    0.0    0.0    0.0     1  0.0    0.0

首先,我不知道为什么我只有两行而不是三行.其次,我想要的输出将是这样的:

First, I do not know why I have only two rows, instead of three. Second, my desired output would be something like:

  text1  would  text2  text3
0   1      0      0      0
1   0      1      1      0
2   0      0      0      1

很抱歉,但我真的对这个输出感到绝望.我还尝试导出和重新导入R上的词干文本,但编码不正确.对于大量数据,我可能需要处理DataFrames.你会建议我什么?

I am sorry but I am really desperate on this output. I also tried to export and reimport the stemmed texts on R, but it doesn't encode correctly. I would probably need to handle DataFrames, as for the huge amount of data. What would you suggest me?

-----更新

使用CountVectorizer我并不完全满意,因为我没有得到一个易于处理的矩阵,无法在其中轻松对行/列进行归一化和求和.

Using CountVectorizer I am not fully satisfied, as I do not get a tractable matrix in which I can normalize and sum rows/columns easily.

这是我正在使用的代码,但是它阻止了Python(数据集太大).如何有效运行它?

Here is the code I am using, but it is blocking Python (dataset too large). How can I run it efficiently?

vect = CountVectorizer(min_df=0., max_df=1.0)
X = vect.fit_transform(texts)
print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string())
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())

推荐答案

为什么不使用sklearn? CountVectorizer()方法将文本文档的集合转换为令牌计数矩阵.更重要的是,它使用scipy给出了计数的稀疏表示.

Why don't you use sklearn? The CountVectorizer() method converts a collection of text documents to a matrix of token counts. What's more it gives a sparse representation of the counts using scipy.

您可以将原始条目输入该方法,也可以按照需要对其进行预处理(阻止+停用词).

You can either give your raw entries to the method or preprocess it as you have done (stemming + stop words).

检查一下: CountVectorizer()

这篇关于从大型语料库创建DTM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆