在Python中将单词及其前10个单词的上下文提取到数据框中 [英] Extracting a word and its prior 10 word context to a dataframe in Python

查看:218
本文介绍了在Python中将单词及其前10个单词的上下文提取到数据框中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python(2.7)还是很陌生,所以如果这是一个非常简单的问题,请原谅我.我希望(i)从已经用NLTK库标记的文本中提取所有以 -ing 结尾的单词,并(ii)提取由此提取的每个单词之前的10个单词.然后,我希望(iii)将这些文件另存为两列的数据框,看起来可能像这样:

I'm fairly new to Python (2.7), so forgive me if this is a ridiculously straightforward question. I wish (i) to extract all the words ending in -ing from a text that has been tokenized with the NLTK library and (ii) to extract the 10 words preceding each word thus extracted. I then wish (iii) to save these to file as a dataframe of two columns that might look something like:

Word        PreviousContext 
starting    stood a moment, as if in a troubled reverie; then
seeming     of it retraced our steps. But Elijah passed on, without
purchasing  a sharp look-out upon the hands: Bildad did all the

我知道该怎么做(i),但不确定如何去做(ii)-(iii).任何帮助将不胜感激和认可.到目前为止,我有:

I know how to do (i), but am not sure how to go about doing (ii)-(iii). Any help would be greatly appreciated and acknowledged. So far I have:

>>> import bs4 
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
...     if w.endswith("ing"):
...             print(w)
... 
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc.. 

推荐答案

在代码行之后:

>>> tokens = word_tokenize(raw)

使用下面的代码生成带有上下文的单词:

use the below code to generate words with their context:

>>> context={}
>>> for i,w in enumerate(tokens):
...      if w.endswith("ing"):
...         try:
...            context[w]=tokens[i:i+10]  # this try...except is used to pass last 10 words whose context is less than 10 words.
...         except: pass
... 
>>> fp=open('dataframes','w')   # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
...    fp.write(word+'\t\t'+' '.join(context[word])+'\n')
... 
>>> fp.close()
>>> fp=open('dataframes','r')  
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
...    print line
... 
Word                PreviousContext
raining             raining , and I saw more fog and mud in
bidding             bidding him good night , if he were yet sitting
growling            growling old Scotch Croesus with great flaps of ears ?
bright-looking      bright-looking bride , I believe ( as I could not
hanging             hanging up in the shop&mdash ; went down to look
scheming            scheming and devising opportunities of being alone with her .
muffling            muffling her hands in it , in an unsettled and
bestowing           bestowing them on Mrs. Gummidge. She was with him all
adorning            adorning , the perfect simplicity of his manner , brought

需要注意的两件事:

  1. nltk将标点符号视为单独的标记,因此将标点符号视为单独的单词.
  2. 我使用字典来存储带有上下文的单词,因此单词的顺序无关紧要,但是可以保证所有带有上下文的单词都存在.

这篇关于在Python中将单词及其前10个单词的上下文提取到数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆