NLP:构建(小型)语料库,或“从哪里获得大量不太专业的英语文本文件?" [英] NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

查看:20
本文介绍了NLP:构建(小型)语料库,或“从哪里获得大量不太专业的英语文本文件?"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人建议在哪里可以找到用于小型语料库的日常英语文本的档案或集合?我一直在使用古腾堡项目书籍作为工作原型,并希望融入更多现代语言.最近的答案这里间接指出了一个很棒的usenet 电影评论存档,我没有想到,而且非常好.对于这个特定的程序,技术使用网档案或编程邮件列表会使结果倾斜并且难以分析,但任何类型的一般博客文本、聊天记录或任何可能对其他人有用的东西都会非常有帮助.此外,非常感谢部分或可下载的研究语料库,它没有太多标记,或者一些启发式以找到合适的 wikipedia 文章子集或任何其他想法.

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.

(顺便说一句,我是一个没有下载的好公民,使用故意慢速的脚本,对托管此类材料的服务器没有要求,以防您认为将我指向一些巨大的东西有道德风险.)

(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)

更新:用户 S0rin 指出维基百科不要求抓取并提供 这个导出工具.古腾堡计划有一个指定的政策这里,底线,尽量不要爬行,但如果您需要:将您的机器人配置为在请求之间等待至少 2 秒."

UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."

更新 2 维基百科转储是要走的路,感谢指出它们的回答者.我最终使用了这里的英文版本:http://download.wikimedia.org/enwiki/20090306/ ,以及大约一半大小的西班牙转储.它们需要清理一些工作,但非常值得,并且它们在链接中包含很多有用的数据.

UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.

推荐答案

  • 使用维基百科转储
    • 需要大量清理
      • 语料库通常很小
      • 标记
      • 您可以使用他们的工具包抓取自己的语料库
      • 口语,翻译

      你总是可以得到自己的,但要注意:HTML 页面通常需要大量清理,所以限制自己使用 RSS 提要.

      You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.

      如果您在商业上这样做,LDC 可能是一个可行的替代方案.

      If you do this commercially, the LDC might be a viable alternative.

      这篇关于NLP:构建(小型)语料库,或“从哪里获得大量不太专业的英语文本文件?"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆