如何使用nltk标记单词列表? [英] How to tokenize a list of words using nltk?

查看:77
本文介绍了如何使用nltk标记单词列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本数据集.这些数据集由许多行组成,每行由两个按Tab键分隔的句子组成,例如:

I have a text dataset. Those dataset consist of many lines that each lines are consist of two sentences split by tab, like this :

this is string 1, first sentence.    this is string 2, first sentence.
this is string 1, second sentence.    this is string 2, second sentence.

然后我用以下代码分割了数据文本:

and then I have split the datatext by this code :

#file readdata.py
from globalvariable import *
import os


class readdata:
    def dataAyat(self):
        global kalimatayat
        fo = open(os.path.join('E:\dataset','dataset.txt'),"r")
        line = []
        for line in fo.readlines():
            datatxt = line.rstrip('\n').split('\t')
            newdatatxt = [x.split('\t') for x in datatxt]
            kalimatayat.append(newdatatxt)
            print newdatatxt

readdata().dataAyat()

它有效,输出为:

[['this is string 1, first sentence.'],['this is string 2, first sentence.']]
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]


我想要做的是使用nltk单词tokenize对这些列表进行tokenize,我期望的输出是这样的:


what I want to do is tokenize those list using nltk word tokenize, and the output I expect is like this :

[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']]
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]

有人知道如何像上面的输出那样标记化吗? 我想在"tokenizer.py"中编写一个标记化函数,并在"mainfile.py"中将其全部调用

anybody knows how to tokenize to be like the output above? I want to code a tokenize function in "tokenizer.py" and call it all in "mainfile.py"

推荐答案

要标记化句子列表,请对其进行迭代并将结果存储在列表中:

To tokenize the list of sentences, iterate over it and store the results in a list:

data = [[['this is string 1, first sentence.'],['this is string 2, first sentence.']],
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]]
results = []
for sentence in data:
    sentence_results = []
    for s in sentence:
        sentence_results.append(nltk.word_tokenize(sentence))
    results.append(sentence_results)

结果将类似于

[[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],  
  ['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']], 
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],
  ['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]]

这篇关于如何使用nltk标记单词列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆