在维基百科转储中查找和下载图像 [英] Finding and downloading images within the Wikipedia Dump

至于下载实际文件，这里是(显然)官方文档.就我而言可以说，批量下载目前(截至 2012 年 9 月)可从镜像获得，但不能直接从维基媒体服务器提供."的意思是，如果您想将所有图像放在 tarball 中，您将不得不使用一面镜子.如果您只是从维基百科和/或共享资源上的数百万图片中提取相对较小的子集，那么直接使用维基媒体服务器应该没问题.

请记住要保持基本礼貌:发送一个用户代理字符串来标识自己，不要点击服务器太难了.特别是，我建议按顺序运行下载，这样您就可以在完成上一个文件后才开始下载下一个文件.无论如何，这不仅比并行下载更容易实现，而且可以确保您不会占用超过您的带宽份额，并允许下载速度或多或少地自动适应服务器负载.

附言.无论您是从镜像还是直接从维基媒体服务器下载文件，您都需要弄清楚它们在哪个目录中.典型的维基百科文件 URL 如下所示:

http://upload.wikimedia.org/wikipedia/en/a/ab/File_name.jpg

其中wikipedia/en"部分标识了维基媒体项目和语言(由于历史原因，Commons 被列为wikipedia/commons")和"a/ab" 部分由 MD5 的前两个十六进制数字给出UTF-8 文件名的哈希值(因为它们在数据库转储中编码).

I'm trying to find a comprehensive list of all images on wikipedia, which I can then filter down to the public domain ones. I've downloaded the SQL dumps from here:

http://dumps.wikimedia.org/enwiki/latest/

And studied the DB schema:

http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png

I think I understand it but when I pick a sample image from a wikipedia page I can't find it anywhere in the dumps. For example:

http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG

I've done a grep on the dumps 'image', 'imagelinks', and 'page' looking for 'Carrizo_2a.JPG' and it's not found.

Are these dumps not complete? Am I misunderstanding the structure? Is there a better way to do this?

Also, to jump ahead one step: after I have filtered my list down and I want to download a bulk set of images (thousands) I saw some mentions that I need to do this from a mirror of the site to prevent overloading wikipedia/wikimedia. If has any guidance on this too, that would be helpful.

解决方案

MediaWiki stores file data in two or three places, depending on how you count:

The actual metadata for current file versions is stored in the image table. This is probably what you primarily want; you'll find the latest en.wikipedia dump of it here.
Data for old superseded file revisions is moved to the oldimage table, which has basically the same structure as the image table. This table is also dumped, the latest one is here.
Finally, each file also (normally) corresponds to a pretty much ordinary wiki page in namespace 6 (File:). You'll find the text of these in the XML dumps, same as for any other pages.

Oh, and the reason you're not finding those files you linked to in the English Wikipedia dumps is that they're from the shared repository at Wikimedia Commons. You'll find them in the Commons data dumps instead.

As for downloading the actual files, here's the (apparently) official documentation. As far as I can tell, all they mean by "Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers." is that if you want all the images in a tarball, you'll have to use a mirror. If you're only pulling a relatively small subset of the millions on images on Wikipedia and/or Commons, it should be fine to use the Wikimedia servers directly.

Just remember to exercise basic courtesy: send a user-agent string identifying yourself and don't hit the servers too hard. In particular, I'd recommend running the downloads sequentially, so that you only start downloading the next file after you've finished the previous one. Not only is that easier to implement than parallel downloading anyway, but it ensures that you don't hog more than your share of the bandwidth and allows the download speed to more or less automatically adapt to server load.

Ps. Whether you download the files from a mirror or directly from the Wikimedia servers, your going to need to figure out which directory they're in. Typical Wikipedia file URLs look like this:

http://upload.wikimedia.org/wikipedia/en/a/ab/File_name.jpg

where the "wikipedia/en" part identifies the Wikimedia project and language (for historical reasons, Commons is listed as "wikipedia/commons") and the "a/ab" part is given by the first two hex digits of the MD5 hash of the filename in UTF-8 (as they're encoded in the database dumps).

这篇关于在维基百科转储中查找和下载图像的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在维基百科转储中查找和下载图像 [英] Finding and downloading images within the Wikipedia Dump

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在维基百科转储中查找和下载图像 [英] Finding and downloading images within the Wikipedia Dump

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭