NLP:建立(小型)语料库,或“在哪里可以得到很多不太专业的英语文本文件?" [英] NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

查看:381
本文介绍了NLP:建立(小型)语料库,或“在哪里可以得到很多不太专业的英语文本文件?"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人建议在小型语料库中找到日常英语文本的档案或集合吗?我一直在使用Gutenberg Project的书作为工作原型,并希望结合更多现代语言. 最近的答案在这里间接指出了一个很好的

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.

(顺便说一句,我是一个很好的公民,正在下载,使用故意慢的脚本,不需要托管此类材料的服务器上的脚本,以防万一您觉得有道德风险将我指向巨大的事物.)

(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)

更新:用户S0rin指出Wikipedia不要求抓取并提供此导出工具代替.古腾堡项目有一个在此处指定的政策,底线,请尽量不要抓取,但如果您需要:将机器人配置为在两次请求之间至少等待2秒."

UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."

更新2 多亏了指出答案的回答者,Wikpedia转储才是必经之路.我最终从这里使用了英语版本: http://download.wikimedia.org/enwiki/20090306 /,一个西班牙垃圾场的大小大约是它的一半.它们是需要清理的工作,但值得这样做,并且它们在链接中包含许多有用的数据.

UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.

推荐答案

  • 使用 Wikipedia转储
    • 需要大量清理
      • Use the Wikipedia dumps
        • needs lots of cleanup
          • 语料库通常很小
          • 已标记
          • 您可以使用他们的工具包来搜寻自己的语料库
          • 口语,已翻译

          您总是可以拥有自己的东西,但要注意:HTML页面通常需要大量清理,因此请仅限于RSS源.

          You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.

          如果您以商业方式进行此操作,则 LDC 可能是可行的选择.

          If you do this commercially, the LDC might be a viable alternative.

          这篇关于NLP:建立(小型)语料库,或“在哪里可以得到很多不太专业的英语文本文件?"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆