如何输出NLTK块文件? [英] How to output NLTK chunks to file?

查看:297
本文介绍了如何输出NLTK块文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个python脚本,我正在使用nltk库来解析,标记,标记和块一些让我们说从网上随机文本。



我需要格式并在文件中写入 chunked1 chunked2 chunked3 。这些类型有 class'nltk.tree.Tree'



更具体地说,我只需要写出匹配的行正则表达式 chunkGram1 chunkGram2 chunkGram3 p>

我该如何做?

 #! /usr/bin/python2.7 

导入nltk
导入re
导入编解码器

xstring = [电子图书馆(也被称为数字图书馆或数字资源库)是数字对象的重点集合,可以包括作为电子媒体格式(与打印,微格式或其他媒体相对)存储的文本,视觉材料,音频材料,视频材料以及手段组织,存储和检索图书馆藏书中的文件和媒体数字图书馆可以在规模和范围上有很大的差异,可以由个人,组织或附属于已建立的实体图书馆或机构或学术机构来维护。 [1]电子内容可以存储在本地,也可以通过计算机网络远程访问。电子图书馆是一种信息检索系统。]


def processLanguage():
用于xstring中的项目:
tokenized = nltk.word_tokenize(item)
标记= nltk.pos_tag(标记化)
#print标记化
#print标记

chunkGram1 = r块:{< JJ \w?> *< ; NN>}
chunkGram2 = rChunk:{< JJ \ w?> *< NNS>}
chunkGram3 = r {< NNP \ w?> *< NNS>}

chunkParser1 = nltk.RegexpParser(chunkGram1)
chunked1 = chunkParser1.parse(标记)

chunkParser2 = nltk.RegexpParser(chunkGram2)
chunked2 = chunkParser2.parse(tagged)

chunkParser3 = nltk.RegexpParser(chunkGram3)
chunked3 = chunkParser2.parse (已标记)

#print chunked1
#print chunked2
#print chunked3

#with codecs.open('path \\\ file \ output.txt','w',encoding ='utf8')as outfile:

#for i,line in enumerate(chunked1):
#ifJJin line :
#outfile.write(line)
#elifNNPin:
#outfile.write(line)


$ b processLanguage()

暂时当我试图运行它时,我得到错误:

 `Traceback(最近一次调用的最后一个):
在< module>文件中的sentdex.py
processLanguage()
文件sentdex.py,第40行,在processLanguage
outfile.write(line)
文件C:\Python27\lib\codecs .py,第688行,写入
返回self.writer.write(data)
在$ b中写入C:\Python27\lib\codecs.py,第351行$ b data,consume = self.encode(object,self.errors)
TypeError:强制为Unicode:需要字符串或缓冲区,找到元组$ b

编辑: @Alvas回答后,我设法做到了我想要的。但是现在,我想知道如何从文本语料库中去除所有非ASCII字符。例如:

 #将清理过的文件存储到变量
中,打开('path \\\\\ file.txt' ,'r')as infile:
xstring = infile.readlines()
infile.close

def remove_non_ascii(line):
return''.join( [i如果ord(i)< 128 else''for i in line])

for i,line in enumerate(xstring):
line = remove_non_ascii(line)

#tokenize和标记文本
def processLanguage():
用于xstring中的项目:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(标记)
打印标记化
打印标记
processLanguage()

以上是S / O中的另一个答案。但它似乎并不奏效。什么可能是错的?我收到的错误是:
$ b $ pre $ UnicodeDecodeError:'ascii'编解码器无法解码位置
中的字节0xe2 (128)


解决方案

几个问题,虽然主要的罪魁祸首是你的 for 循环不会修改 xstring



我会在你的代码中解决所有问题:

不能用单个 \ 写这样的路径,因为 \ t 将被解释为一个制表符,而 \ f 作为换行字符。你必须加倍。我知道这是一个例子,但是经常出现这样的混乱:

pre $ with open('path \\to\ \file.txt','r')as infile:
xstring = infile.readlines()

以下 infile.close 行是错误。它不会调用close方法,它实际上并不做任何事情。此外,如果您在任何地方的任何答案中都看到了这一行,那么您的文件已经被with子句关闭了,请直接用下面的注释来回答: file.close code>是错误的,应该是 file.close()



你需要知道它用''替换每个非ascii字符,它会打破诸如naïve和café之类的单词

  def remove_non_ascii(line):
return''.join([i if ord(i)< 128 else''for i in line])

但是这里是你的代码失败的原因是一个unicode异常:你没有修改 xstring 在所有,也就是说,你正在计算的行被删除的ascii字符,是的,但这是一个新的值,这是从来没有存储到列表中:



  for i,enumerate(xstring):
line = remove_non_ascii(li ne)



相反,它应该是:

  for i,enumerate(xstring):
xstring [i] = remove_non_ascii在线)

或者我的首选pythonic:

< pre $ xstring = [remove_non_ascii(line)for line in xstring]



< hr>

尽管这些Unicode错误的发生主要是因为您正在使用Python 2.7来处理纯Unicode文本,而最近的Python 3版本正在推进之中,所以我建议您如果你刚刚开始的任务,你很快就会升级到Python 3.4+。


I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web.

I need to format and write in a file the output of chunked1,chunked2,chunked3. These have type class 'nltk.tree.Tree'

More specifically I need to write only the lines that match the regular expressions chunkGram1, chunkGram2, chunkGram3.

How can i do that?

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        #print tokenized
        #print tagged

        chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
        chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
        chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""

        chunkParser1 = nltk.RegexpParser(chunkGram1)
        chunked1 = chunkParser1.parse(tagged)

        chunkParser2 = nltk.RegexpParser(chunkGram2)
        chunked2 = chunkParser2.parse(tagged)

        chunkParser3 = nltk.RegexpParser(chunkGram3)
        chunked3 = chunkParser2.parse(tagged)

        #print chunked1
        #print chunked2
        #print chunked3

        # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:

            # for i,line in enumerate(chunked1):
                # if "JJ" in line:
                    # outfile.write(line)
                # elif "NNP" in line:
                    # outfile.write(line)



processLanguage()

For the time being when I am trying to run it I get error:

`Traceback (most recent call last):
  File "sentdex.py", line 47, in <module>
    processLanguage()
  File "sentdex.py", line 40, in processLanguage
    outfile.write(line)
  File "C:\Python27\lib\codecs.py", line 688, in write
    return self.writer.write(data)
  File "C:\Python27\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`

edit: After @Alvas answer I managed to do what I wanted. However now, I would like to know how I could strip all non-ascii characters from a text corpus. example:

#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
    xstring = infile.readlines()
infile.close

    def remove_non_ascii(line):
        return ''.join([i if ord(i) < 128 else ' ' for i in line])

    for i, line in enumerate(xstring):
        line = remove_non_ascii(line)

#tokenize and tag text
def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        print tokenized
        print tagged
processLanguage()

This above is taken from another answer here in S/O. However it doesn't seem to work. What might be wrong? The error I am getting is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)

解决方案

Your code has several problems, though the main culprit is that your for loop does not modify the contents of the xstring:

I will address all the issues in your code here:

you cannot write paths like this with single \, as \t will be interpreted as a tabulator, and \f as a linefeed character. You must double them. I know it was an example here, but such confusions often arise:

with open('path\\to\\file.txt', 'r') as infile:
    xstring = infile.readlines()

The following infile.close line is wrong. It does not call the close method, it does not actually do anything. Furthermore, your file was closed already by the with clause if you see this line in any answer anywhere, please just downvote the answer outright with the comment saying that file.close is wrong, should be file.close().

The following should work, but you need to be aware that it replacing every non-ascii character with ' ' it will break words such as naïve and café

def remove_non_ascii(line):
    return ''.join([i if ord(i) < 128 else ' ' for i in line])

But here is the reason why your code fails with an unicode exception: you are not modifying the elements of xstring at all, that is, you are calculating the line with ascii characters removed, yes, but that is a new value, that is never stored into the list:

for i, line in enumerate(xstring):
   line = remove_non_ascii(line)

Instead it should be:

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

or my preferred very pythonic:

xstring = [ remove_non_ascii(line) for line in xstring ]


Though these Unicode Errors occur mainly just because you are using Python 2.7 for handling pure Unicode text, something for which recent Python 3 versions are way ahead, thus I'd recommend you that if you are in very beginning with task that you'd upgrade to Python 3.4+ soon.

这篇关于如何输出NLTK块文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆