维基百科文字下载 [英] Wikipedia text download

查看:105
本文介绍了维基百科文字下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的大学项目下载完整的Wikipedia文本.我是否需要编写自己的蜘蛛才能下载此文件,还是可以在线获取Wikipedia的公共数据集?

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online?

仅向您概述我的项目,我想找出我感兴趣的几篇文章中有趣的词.但是,为了找到这些有趣的词,我计划应用tf/idf来计算每个词的词频单词并选择高频的单词.但是要计算tf,我需要知道整个Wikipedia的总数.

To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the total occurrences in whole of Wikipedia.

这怎么办?

推荐答案

: http://en .wikipedia.org/wiki/Wikipedia_database

维基百科为感兴趣的用户提供了所有可用内容的免费副本.这些数据库可用于镜像,个人使用,非正式备份,脱机使用或数据库查询(例如Wikipedia:Maintenance).所有文本内容均受Creative Commons Attribution-ShareAlike 3.0许可证(CC-BY-SA)和GNU Free Documentation License(GFDL)的许可.图像和其他文件有不同的术语,详见其描述页面.有关遵守这些许可证的建议,请参阅Wikipedia:版权.

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

似乎您也很幸运.在转储部分:

Seems that you are in luck too. From the dump section:

截至2010年3月12日,可以在 http://://download.wikimedia.org/enwiki/20100130/这是自2008年以来创建的英语维基百科的第一个完整转储. 请注意,最近的转储(例如20100312转储)不完整.

As of 12 March 2010, the latest complete dump of the English-language Wikipedia can be found at http://download.wikimedia.org/enwiki/20100130/ This is the first complete dump of the English-language Wikipedia to have been created since 2008. Please note that more recent dumps (such as the 20100312 dump) are incomplete.

所以数据只有9天了:)

So the data is only 9 days old :)

这篇关于维基百科文字下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆