在Python中将单词及其前10个单词的上下文提取到数据框中 [英] Extracting a word and its prior 10 word context to a dataframe in Python
问题描述
我对Python(2.7)还是很陌生,所以如果这是一个非常简单的问题,请原谅我.我希望(i)从已经用NLTK库标记的文本中提取所有以 -ing 结尾的单词,并(ii)提取由此提取的每个单词之前的10个单词.然后,我希望(iii)将这些文件另存为两列的数据框,看起来可能像这样:
I'm fairly new to Python (2.7), so forgive me if this is a ridiculously straightforward question. I wish (i) to extract all the words ending in -ing from a text that has been tokenized with the NLTK library and (ii) to extract the 10 words preceding each word thus extracted. I then wish (iii) to save these to file as a dataframe of two columns that might look something like:
Word PreviousContext
starting stood a moment, as if in a troubled reverie; then
seeming of it retraced our steps. But Elijah passed on, without
purchasing a sharp look-out upon the hands: Bildad did all the
我知道该怎么做(i),但不确定如何去做(ii)-(iii).任何帮助将不胜感激和认可.到目前为止,我有:
I know how to do (i), but am not sure how to go about doing (ii)-(iii). Any help would be greatly appreciated and acknowledged. So far I have:
>>> import bs4
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
... if w.endswith("ing"):
... print(w)
...
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc..
推荐答案
在代码行之后:
>>> tokens = word_tokenize(raw)
使用下面的代码生成带有上下文的单词:
use the below code to generate words with their context:
>>> context={}
>>> for i,w in enumerate(tokens):
... if w.endswith("ing"):
... try:
... context[w]=tokens[i:i+10] # this try...except is used to pass last 10 words whose context is less than 10 words.
... except: pass
...
>>> fp=open('dataframes','w') # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
... fp.write(word+'\t\t'+' '.join(context[word])+'\n')
...
>>> fp.close()
>>> fp=open('dataframes','r')
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
... print line
...
Word PreviousContext
raining raining , and I saw more fog and mud in
bidding bidding him good night , if he were yet sitting
growling growling old Scotch Croesus with great flaps of ears ?
bright-looking bright-looking bride , I believe ( as I could not
hanging hanging up in the shop&mdash ; went down to look
scheming scheming and devising opportunities of being alone with her .
muffling muffling her hands in it , in an unsettled and
bestowing bestowing them on Mrs. Gummidge. She was with him all
adorning adorning , the perfect simplicity of his manner , brought
需要注意的两件事:
- nltk将标点符号视为单独的标记,因此将标点符号视为单独的单词.
- 我使用字典来存储带有上下文的单词,因此单词的顺序无关紧要,但是可以保证所有带有上下文的单词都存在.
这篇关于在Python中将单词及其前10个单词的上下文提取到数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!