NLP:建立(小型)语料库，或“在哪里可以得到很多不太专业的英语文本文件?" [英] NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

查看：381 发布时间：2020/5/18 0:41:01 nlp linguistics corpus

本文介绍了NLP:建立(小型)语料库，或“在哪里可以得到很多不太专业的英语文本文件?"的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有人建议在小型语料库中找到日常英语文本的档案或集合吗?我一直在使用Gutenberg Project的书作为工作原型，并希望结合更多现代语言. 最近的答案在这里间接指出了一个很好的

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.

(顺便说一句，我是一个很好的公民，正在下载，使用故意慢的脚本，不需要托管此类材料的服务器上的脚本，以防万一您觉得有道德风险将我指向巨大的事物.)

(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)

更新:用户S0rin指出Wikipedia不要求抓取并提供此导出工具代替.古腾堡项目有一个在此处指定的政策，底线，请尽量不要抓取，但如果您需要:将机器人配置为在两次请求之间至少等待2秒."

UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."

更新2 多亏了指出答案的回答者，Wikpedia转储才是必经之路.我最终从这里使用了英语版本: http://download.wikimedia.org/enwiki/20090306 /，一个西班牙垃圾场的大小大约是它的一半.它们是需要清理的工作，但值得这样做，并且它们在链接中包含许多有用的数据.

UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.

NLP:建立(小型)语料库，或“在哪里可以得到很多不太专业的英语文本文件?" [英] NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

NLP:建立(小型)语料库，或“在哪里可以得到很多不太专业的英语文本文件?" [英] NLP: Building (small) corpora, or &quot;Where to get lots of not-too-specialized English-language text files?&quot;

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

NLP:建立(小型)语料库，或“在哪里可以得到很多不太专业的英语文本文件?" [英] NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

登录关闭