如何在python中使用NLTK读取已解析句子的语料库? [英] How to read corpus of parsed sentences using NLTK in python?

查看:255
本文介绍了如何在python中使用NLTK读取已解析句子的语料库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BLLIP 1987-89 WSJ语料库第1版( https://catalog.ldc. upenn.edu/LDC2000T43 ).

I am working with the BLLIP 1987-89 WSJ Corpus Release 1 (https://catalog.ldc.upenn.edu/LDC2000T43).

我正在尝试使用NLTK的SyntaxCorpusReader类读取解析的句子.我正在尝试使其仅使用一个文件的简单示例. 这是我的代码...

I am trying to use NLTK's SyntaxCorpusReader class to read in the parsed sentences. I'm trying to get it to work with a simple example of just 1 file. Here is my code...

from nltk.corpus.reader import SyntaxCorpusReader

path = '/corpus/wsj'
filename = 'wsj1'
reader = SyntaxCorpusReader('/corpus/wsj','wsj1')

我能够从文件中看到原始文本.它返回已解析句子的字符串.

I am able to see the raw text from the file. It returns a string of the parsed sentences.

reader.raw()
u"(S1 (S (PP-LOC (IN In)\n\t(NP (NP (DT a) (NN move))\n\t (SBAR (WHNP#0 (WDT that))\n\t  (S (NP-SBJ (-NONE- *T*-0))\n\t   (VP (MD would)\n\t    (VP (VB represent)\n\t     (NP (NP (DT a) (JJ major) (NN break))\n\t      (PP (IN with) (NP (NN tradition))))\n\t     (PP-LOC (IN in)\n\t      (NP#1004 (DT the) (JJ legal) (NN profession)))))))))\n     (, ,)\n     (NP-SBJ#1005 (NP (NN law) (NNS firms))\n      (PP-LOC (IN in) (NP#1006 (DT this) (NN city))))\n     (VP (MD may)\n      (VP (VB become)\n       (NP (NP (DT the) (JJ first))\n\t(PP-LOC (IN in) (NP (DT the) (NN nation)))\n\t(SBAR (WHNP#1 (-NONE- 0))\n\t (S (NP-SBJ (-NONE- *T*-1))\n\t  (VP (TO to)\n\t   (VP (VB reward)\n\t    (NP#1009 (NNS non-lawyers))\n\t    (PP-MNR-CLR (IN with)\n\t     (NP#1010 (NP (DT the) (VBN cherished) (NN title))\n\t      (PP (IN of) (NP (NN partner))))))))))))\n     (. .)))\n...'

但是,当我尝试获取已解析的句子时,会收到错误消息.

But when I try to get the parsed sentences, I receive an error.

reader.parsed_sents()
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/nltk/compat.py", line 487, in wrapper
return method(self).encode('ascii', 'backslashreplace')
File "/usr/lib/python2.7/dist-packages/nltk/util.py", line 664, in __repr__
for elt in self:
File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
tokens = self.read_block(self._stream)
 File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 430, in _read_parsed_sent_block
return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))
 File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 378, in _read_block
raise NotImplementedError()
NotImplementedError

我不确定是什么问题.我的目标是读取解析的句子,并使用NLTK的树类提取句子的文本,并可能导航树的结构.

I'm not sure what the issue is. My goal was to read in the parsed sentences and use NLTK's tree class to extract the text of the sentences, and perhaps navigate the tree structure.

推荐答案

哈哈,我在那里待了一段时间. NotImplementedError不是错误,它是NLTK告诉您使用的是不完整类的方式. SyntaxCorpusReader是抽象类",旨在用作具有特定复杂语法的语料库的基础.就您而言,您只需要使用BracketParseCorpusReader即可:

Hah, had me going for a while there. That NotImplementedError is not a bug, it's the NLTK's way of telling you that you're using an incomplete class. SyntaxCorpusReader is an "abstract class", intended as a basis for corpora with specific complex syntax. In your case, you just need to use BracketParseCorpusReader instead:

reader = BracketParseCorpusReader('/corpus/wsj','wsj1')
print(reader.parsed_sents()[0])

这篇关于如何在python中使用NLTK读取已解析句子的语料库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆