如何在字符串而不是列表中输出NLTK pos_tag? [英] How to output NLTK pos_tag in the string instead of a list?
问题描述
我需要在一个大型数据集上运行nltk.pos_tag,并且需要使其输出像斯坦福标记器提供的那样.
I need to run nltk.pos_tag on a large dataset and need to have its output like the one that is offered by Stanford tagger.
例如,当我运行以下代码时;
For example while running the following code I have;
import nltk
text=nltk.word_tokenize("We are going out.Just you and me.")
print nltk.pos_tag(text)
输出为: [('我们','PRP'),('are','VBP'),('going','VBG'),('out.Just','IN'),('you','PRP '),('and','CC'),('me','PRP'),('.','.')]
the output is: [('We', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('out.Just', 'IN'), ('you', 'PRP'), ('and', 'CC'), ('me', 'PRP'), ('.', '.')]
在我需要像这样的情况下
In the case that I need it to be like:
We/PRP are/VBP going/VBG out.Just/NN you/PRP and/CC me/PRP ./.
我更喜欢不使用字符串函数,并且需要直接输出,因为文本量很大,并且增加了处理的时间复杂度
I prefer to not using string functions and need a dirrect output because the amount of the text is so high and it adds lots of time complexities to the processing
推荐答案
简而言之:
' '.join([word + '/' + pos for word, pos in tagged_sent]
冗长:
我认为您对使用字符串函数来连接字符串的想法过高,这实际上并不那么昂贵.
I think you're overthinking about using string functions to concat the strings, it's really not that expensive.
import time
from nltk.corpus import brown
tagged_corpus = brown.tagged_sents()
start = time.time()
with open('output.txt', 'w') as fout:
for i, sent in enumerate(tagged_corpus):
print(' '.join([word + '/' + pos for word, pos in sent]), end='\n', file=fout)
end = time.time() - start
print (i, end)
在我的笔记本电脑上,棕色语料库的所有57339个句子花费了2.955秒.
It took 2.955 seconds on my laptop for all 57339 sentences from the brown corpus.
[输出]:
$ head -n1 output.txt
The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.
但是使用字符串将单词和POS连接起来会在以后需要读取标记的输出时引起麻烦,例如
But using string to concatenate the word and POS can cause trouble later on when you need to read your tagged output, e.g.
>>> from nltk import pos_tag
>>> tagged_sent = pos_tag('cat / dog'.split())
>>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent])
>>> tagged_sent_str
'cat/NN //CD dog/NN'
>>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()]
[('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]
如果要保存标记的输出然后再阅读,最好使用pickle
保存标记的输出,例如
If you want to saved the tagged output and then read it later, it's better to use pickle
to save the tagged_output, e.g.
>>> import pickle
>>> tagged_sent = pos_tag('cat / dog'.split())
>>> with open('tagged_sent.pkl', 'wb') as fout:
... pickle.dump(tagged_sent, fout)
...
>>> tagged_sent = None
>>> tagged_sent
>>> with open('tagged_sent.pkl', 'rb') as fin:
... tagged_sent = pickle.load(fin)
...
>>> tagged_sent
[('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]
这篇关于如何在字符串而不是列表中输出NLTK pos_tag?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!