Python中用于标记化数据的有效数据结构是什么? [英] What is an efficient data structure for tokenized data in Python?

查看:88
本文介绍了Python中用于标记化数据的有效数据结构是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据框,其中有一列带有一些文本.我想修改数据框,以使所有行中出现的每个不同的单词都有一个列,并且一个布尔值指示该单词是否出现在我的文本列中的特定行的值中.

I have a pandas dataframe that has a column with some text. I want to modify the dataframe such that there is a column for every distinct word that occurs across all rows, and a boolean indicating whether or not that word occurs in that particular row's value in my text column.

我有一些代码可以做到这一点:

I have some code to do this:

from pandas import *

a = read_table('file.tsv', sep='\t', index_col=False)
b = DataFrame(a['text'].str.split().tolist()).stack().value_counts()

for i in b.index:
    a[i] = Series(numpy.zeros(len(a.index)))

for i in b.index:
    for j in a.index:
        if i in str.split(a['text'][j]:
            a[i][j] = 1

但是,我的数据集非常大(200,000行和大约70,000个唯一单词).有没有一种更有效的方式来执行此操作,而不会破坏我的计算机?

However, my dataset is very large (200,000 rows and about 70,000 unique words). Is there a more efficient way to do this that won't destroy my computer?

推荐答案

我建议使用sklearn,尤其是CountVectorizer.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer(binary =True)



 df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat'],'labels':\
                  [1,0,1,1,0,0,1,1]})




X = vect.fit_transform(df['text'].values)
y = df['labels'].values
X

<8x16 sparse matrix of type '<type 'numpy.int64'>'
with 23 stored elements in Compressed Sparse Row format>

这将返回一个sparse matrix,其中mdf中的行,而n是单词集.稀疏格式更适合用于保存矩阵的大多数元素为0的内存.将其保留为稀疏似乎是可行的方法,并且许多"sklearn"算法都采用稀疏输入.

This returns a sparse matrix where m are the rows from df and n is the set of words. The sparse format is preferable for saving memory where the majority of elements of the matrix are 0. Leaving it as sparse seems the way to go, and many of the 'sklearn' algorithms take a sparse input.

您可以从X创建数据框(如果确实有必要,但这会很大):

You can create a data frame from X (if really necessary, but it will be big):

word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names())

这篇关于Python中用于标记化数据的有效数据结构是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆