在维基百科转储中查找和下载图像 [英] Finding and downloading images within the Wikipedia Dump

查看:62
本文介绍了在维基百科转储中查找和下载图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 wikipedia 上找到所有图像的完整列表,然后我可以将其过滤到公共领域的图像.我已经从这里下载了 SQL 转储:

http://dumps.wikimedia.org/enwiki/latest/

并研究了数据库架构:

http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.svg.

我想我理解它,但是当我从维基百科页面中选择示例图像时,我在转储中的任何地方都找不到它.例如:

http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG

我已经对转储图像"、图像链接"和页面"进行了 grep,寻找Carrizo_2a.JPG",但没有找到.

这些转储还没有完成吗?我误解了结构吗?有一个更好的方法吗?

另外,向前迈出一步:在我筛选出我的列表并想下载大量图像(数千张)之后,我看到一些提到我需要从网站的镜像中执行此操作以防止过载维基百科/维基媒体.如果对此也有任何指导,那将很有帮助.

解决方案

MediaWiki 将文件数据存储在两个或三个位置,具体取决于您的计数方式:

  • 当前文件版本的实际元数据存储在 image 表中.这可能是您主要想要的;您可以在此处找到最新的 en.wikipedia 转储.>

  • 旧文件修订版的数据移至 oldimage 表,该表与 image 表的结构基本相同.这个表也被转储了,最新的在这里.

  • 最后,每个文件也(通常)对应于命名空间 6 中的一个非常普通的 wiki 页面(File:).您将在 XML 转储中找到这些文本,与任何其他页面相同.

哦,您没有在英文维基百科转储中找到您链接到的那些文件的原因是它们来自 Wikimedia Commons 的共享存储库.您可以在 Commons 数据转储中找到它们.

至于下载实际文件,这里是(显然)官方文档.就我而言可以说,批量下载目前(截至 2012 年 9 月)可从镜像获得,但不能直接从维基媒体服务器提供."的意思是,如果您想将所有图像放在 tarball 中,您将不得不使用一面镜子.如果您只是从维基百科和/或共享资源上的数百万图片中提取相对较小的子集,那么直接使用维基媒体服务器应该没问题.

请记住要保持基本礼貌:发送一个用户代理字符串来标识自己,不要点击服务器太难了.特别是,我建议按顺序运行下载,这样您就可以在完成上一个文件后才开始下载下一个文件.无论如何,这不仅比并行下载更容易实现,而且可以确保您不会占用超过您的带宽份额,并允许下载速度或多或少地自动适应服务器负载.

附言.无论您是从镜像还是直接从维基媒体服务器下载文件,您都需要弄清楚它们在哪个目录中.典型的维基百科文件 URL 如下所示:

http://upload.wikimedia.org/wikipedia/en/a/ab/File_name.jpg

其中wikipedia/en"部分标识了维基媒体项目和语言(由于历史原因,Commons 被列为wikipedia/commons")和"a/ab" 部分 由 MD5 的前两个十六进制数字给出UTF-8 文件名的哈希值(因为它们在数据库转储中编码).

I'm trying to find a comprehensive list of all images on wikipedia, which I can then filter down to the public domain ones. I've downloaded the SQL dumps from here:

http://dumps.wikimedia.org/enwiki/latest/

And studied the DB schema:

http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png

I think I understand it but when I pick a sample image from a wikipedia page I can't find it anywhere in the dumps. For example:

http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG

I've done a grep on the dumps 'image', 'imagelinks', and 'page' looking for 'Carrizo_2a.JPG' and it's not found.

Are these dumps not complete? Am I misunderstanding the structure? Is there a better way to do this?

Also, to jump ahead one step: after I have filtered my list down and I want to download a bulk set of images (thousands) I saw some mentions that I need to do this from a mirror of the site to prevent overloading wikipedia/wikimedia. If has any guidance on this too, that would be helpful.

解决方案

MediaWiki stores file data in two or three places, depending on how you count:

  • The actual metadata for current file versions is stored in the image table. This is probably what you primarily want; you'll find the latest en.wikipedia dump of it here.

  • Data for old superseded file revisions is moved to the oldimage table, which has basically the same structure as the image table. This table is also dumped, the latest one is here.

  • Finally, each file also (normally) corresponds to a pretty much ordinary wiki page in namespace 6 (File:). You'll find the text of these in the XML dumps, same as for any other pages.

Oh, and the reason you're not finding those files you linked to in the English Wikipedia dumps is that they're from the shared repository at Wikimedia Commons. You'll find them in the Commons data dumps instead.

As for downloading the actual files, here's the (apparently) official documentation. As far as I can tell, all they mean by "Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers." is that if you want all the images in a tarball, you'll have to use a mirror. If you're only pulling a relatively small subset of the millions on images on Wikipedia and/or Commons, it should be fine to use the Wikimedia servers directly.

Just remember to exercise basic courtesy: send a user-agent string identifying yourself and don't hit the servers too hard. In particular, I'd recommend running the downloads sequentially, so that you only start downloading the next file after you've finished the previous one. Not only is that easier to implement than parallel downloading anyway, but it ensures that you don't hog more than your share of the bandwidth and allows the download speed to more or less automatically adapt to server load.

Ps. Whether you download the files from a mirror or directly from the Wikimedia servers, your going to need to figure out which directory they're in. Typical Wikipedia file URLs look like this:

http://upload.wikimedia.org/wikipedia/en/a/ab/File_name.jpg

where the "wikipedia/en" part identifies the Wikimedia project and language (for historical reasons, Commons is listed as "wikipedia/commons") and the "a/ab" part is given by the first two hex digits of the MD5 hash of the filename in UTF-8 (as they're encoded in the database dumps).

这篇关于在维基百科转储中查找和下载图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆