NLTK 的 XMLCorpusReader 可以用于多文件语料库吗? [英] Can NLTK's XMLCorpusReader be used on a multi-file corpus?

查看:31
本文介绍了NLTK 的 XMLCorpusReader 可以用于多文件语料库吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用NLTK做对的纽约时报注释语料库,其中包含每篇文章的 XML 文件(采用新闻行业文本格式 NITF).

I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).

我可以毫无问题地解析单个文档,如下所示:

I can parse individual documents with no problem like so:

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

不过我需要处理整个语料库.我试过这样做:

I need to work on the whole corpus though. I tried doing this:

reader = XMLCorpusReader('corpora/nytimes', r'.*')

但这不会创建一个可用的阅读器对象.例如

but this doesn't create a useable reader object. For instance

len(reader.words())

返回

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

如何将这个语料库读入 NLTK?

How do I read this corpus into NLTK?

我是 NLTK 的新手,因此非常感谢您的帮助.

I'm new to NLTK so any help is greatly appreciated.

推荐答案

我不是 NLTK 专家,所以可能有更简单的方法来做到这一点,但我天真地建议您使用 Python 的glob 模块.它支持 Unix-stle 路径名模式扩展.

I'm no NLTK expert, so there may be an easier way to do this, but naively I would suggest that you use Python's glob module. It supports Unix-stle pathname pattern expansion.

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

这样就会以列表形式为您提供与指定表达式匹配的文件的名称.然后根据您想要/需要一次打开多少个,您可以执行以下操作:

So that would give you the names of the files matching the expression specified, in list-form. Then depending on how many of them you want/need to have open at once, you could do:

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

正如@waffle paradox 所建议的:您还可以缩减此文本 列表以满足您的特定需求.

As suggested by @waffle paradox:, you can also whittle this list of texts down to suit your specific needs.

这篇关于NLTK 的 XMLCorpusReader 可以用于多文件语料库吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆