NLTK的XMLCorpusReader可以用于多文件语料库吗? [英] Can NLTK's XMLCorpusReader be used on a multi-file corpus?

查看:87
本文介绍了NLTK的XMLCorpusReader可以用于多文件语料库吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用NLTK做对的纽约时报注释语料库,其中包含每篇文章的XML文件(新闻行业文本格式NITF).

I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).

我可以毫无问题地解析单个文档:

I can parse individual documents with no problem like so:

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

尽管如此,我仍需要研究整个语料库. 我尝试这样做:

I need to work on the whole corpus though. I tried doing this:

reader = XMLCorpusReader('corpora/nytimes', r'.*')

,但这不会创建可用的阅读器对象.例如

but this doesn't create a useable reader object. For instance

len(reader.words())

返回

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

我如何将此语料读入NLTK?

How do I read this corpus into NLTK?

我是NLTK的新手,非常感谢您的帮助.

I'm new to NLTK so any help is greatly appreciated.

推荐答案

我不是NLTK专家,所以可能有更简单的方法来做到这一点,但是我天真的建议您使用

I'm no NLTK expert, so there may be an easier way to do this, but naively I would suggest that you use Python's glob module. It supports Unix-stle pathname pattern expansion.

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

因此,它将以列表形式为您提供与指定表达式匹配的文件名. 然后,根据您希望/需要一次打开的数量,可以执行以下操作:

So that would give you the names of the files matching the expression specified, in list-form. Then depending on how many of them you want/need to have open at once, you could do:

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

如@waffle paradox:所建议,您也可以将texts的列表缩减为适合您的特定需求.

As suggested by @waffle paradox:, you can also whittle this list of texts down to suit your specific needs.

这篇关于NLTK的XMLCorpusReader可以用于多文件语料库吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆