从网页上传图片 [英] Upload images from from web-page
问题描述
我想实施类似于此 http://www.tineye.com的功能/parse?url=yahoo.com - 允许用户从任何网页上传图片。
I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.
我的主要问题是,它需要太多时间
Main problem for me is that it takes too much time for web pages with big number of images.
我是在Django(使用curl或urllib)根据下面的方案这样做:
I'm doing this in Django (using curl or urllib) according to the next scheme:
-
抓取网页的html(大页大约需要1秒):
Grab html of the page (takes about 1 sec for big pages):
file = urllib.urlopen(requested_url)
html_string = file.read()
使用HTML解析器(BeautifulSoup)解析它,查找img标签,并将所有src图像写入列表。 (大页面大约需要1秒)
Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
检查列表中所有图片的大小,如果他们足够大,可以在json响应长约15秒,当网页上有约80个图像时)。下面是函数的代码:
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:
def get_image_size(uri):
file = urllib.urlopen(uri)
p = ImageFile.Parser()
data = file.read(1024)
if not data:
return None
p.feed(data)
if p.image:
return p.image.size
file.close()
#not an image
return None
如你所见,我不会加载完整的图片以获得它的大小1kb的。但是它仍然需要太多的时间,当有很多的图像(我调用这个函数一次的每个图像找到)。
As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).
那么如何使它工作更快?
So how can I make it work faster?
可能有什么办法不要求每一个图像的请求?
May be is there any way for not making a request for every single image?
任何帮助将非常感激。
谢谢!
推荐答案
我可以想到几个优化:
- 使用HEAD获取图片大小 $
- parse as you are reading a file from the stream
- use SAX parser (which will be great with point above)
- use HEAD to get size of the images
- use queue to put your images, then use few threads to connect and get file sizes
HEAD请求示例:
$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl
HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close
Connection closed by foreign host.
这篇关于从网页上传图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!