批量下载从使用Python /的urllib / beautifulsoup URL文本和图像? [英] Batch downloading text and images from URL with Python / urllib / beautifulsoup?
问题描述
我已经通过几个职位浏览这里,但我只是不能与Python给定的URL让我周围的批量下载图片和文字的头。
I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python.
import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys
def getAllImages(url):
query = urllib2.Request(url)
user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
query.add_header("User-Agent", user_agent)
page = BeautifulSoup(urllib2.urlopen(query))
for div in page.findAll("div", {"class": "thumbnail"}):
print "found thumbnail"
for img in div.findAll("img"):
print "found image"
src = img["src"]
if src:
src = absolutize(src, pageurl)
f = open(src,'wb')
f.write(urllib.urlopen(src).read())
f.close()
for h5 in div.findAll("h5"):
print "found Headline"
value = (h5.contents[0])
print >> headlines.txt, value
def main():
getAllImages("http://www.nytimes.com/")
以上是现在的一些更新code。会发生什么事,什么都不是。在code不会去找到一个缩略图,显然,没有造成任何打印的任何分区....所以也许我错过了一些三分球让到包含图像和标题右边的div?
Above is now some updated code. What happens, is nothing. The code does not get to find any div with a thumbnail, obviously, no result in any of the print.... So probably I am missing some pointers in getting to the right divs containing the images and headlines?
非常感谢!
推荐答案
您使用的是不知道该怎么写你是路过它的src $ C $的文件路径操作系统C>。请确保该名称用来将文件保存到磁盘是一个操作系统实际上可以使用:
The OS you are using doesn't know how to write to the file path you are passing it in src
. Make sure that the name you use to save the file to disk is one the OS can actually use:
src = "abc.com/alpha/beta/charlie.jpg"
with open(src, "wb") as f:
# IOError - cannot open file abc.com/alpha/beta/charlie.jpg
src = "alpha/beta/charlie.jpg"
os.makedirs(os.path.dirname(src))
with open(src, "wb" as f:
# Golden - write file here
和一切都将开始工作。
一对夫妇的更多的想法:
A couple of additional thoughts:
- 确保正常化的保存文件路径(如
os.path.join(some_root_dir,* relative_file_path *)
) - 否则,你将遍布在写你的图像根据他们的的src
。 硬盘 - 除非你正在运行的一些测试,这是很好的宣传,你是在你的
USER_AGENT
字符串和荣誉的robots.txt一个机器人
文件(或交替,提供某种形式的联系信息,以便人们可以问你停止,如果他们需要)。
- Make sure to normalize the save file path (e. g.
os.path.join(some_root_dir, *relative_file_path*)
) - otherwise you'll be writing images all over your hard drive depending on theirsrc
. - Unless you are running tests of some kind, it's good to advertise that you are a bot in your
user_agent
string and honorrobots.txt
files (or alternately, provide some kind of contact information so people can ask you to stop if they need to).
这篇关于批量下载从使用Python /的urllib / beautifulsoup URL文本和图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!