从网页上传图片 [英] Upload images from from web-page

查看:233
本文介绍了从网页上传图片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实施类似于此 http://www.tineye.com的功能/parse?url=yahoo.com - 允许用户从任何网页上传图片。

I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.

我的主要问题是,它需要太多时间

Main problem for me is that it takes too much time for web pages with big number of images.

我是在Django(使用curl或urllib)根据下面的方案这样做:

I'm doing this in Django (using curl or urllib) according to the next scheme:


  1. 抓取网页的html(大页大约需要1秒):

  1. Grab html of the page (takes about 1 sec for big pages):

file = urllib.urlopen(requested_url)
html_string = file.read()


  • 使用HTML解析器(BeautifulSoup)解析它,查找img标签,并将所有src图像写入列表。 (大页面大约需要1秒)

  • Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)

    检查列表中所有图片的大小,如果他们足够大,可以在json响应长约15秒,当网页上有约80个图像时)。下面是函数的代码:

    Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:



    
     def get_image_size(uri):
        file = urllib.urlopen(uri)
        p = ImageFile.Parser()
        data = file.read(1024)
        if not data:
            return None
        p.feed(data)
        if p.image:
            return p.image.size
        file.close()
        #not an image
        return None
    

    如你所见,我不会加载完整的图片以获得它的大小1kb的。但是它仍然需要太多的时间,当有很多的图像(我调用这个函数一次的每个图像找到)。

    As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).

    那么如何使它工作更快?

    So how can I make it work faster?

    可能有什么办法不要求每一个图像的请求?

    May be is there any way for not making a request for every single image?

    任何帮助将非常感激。

    谢谢!

    推荐答案

    我可以想到几个优化:



    1. 使用HEAD获取图片大小
    2. $
    3. 使用SAX解析器b $ b
    4. 使用队列放置图片,然后使用几个线程连接并获取文件大小

    1. parse as you are reading a file from the stream
    2. use SAX parser (which will be great with point above)
    3. use HEAD to get size of the images
    4. use queue to put your images, then use few threads to connect and get file sizes

    HEAD请求示例:

    $ telnet m.onet.pl 80
    Trying 213.180.150.45...
    Connected to m.onet.pl.
    Escape character is '^]'.
    HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
    host: m.onet.pl
    
    HTTP/1.0 200 OK
    Server: nginx/0.8.53
    Date: Sat, 09 Apr 2011 18:32:44 GMT
    Content-Type: image/jpeg
    Content-Length: 37545
    Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
    Expires: Sat, 16 Apr 2011 18:32:44 GMT
    Cache-Control: max-age=604800
    Accept-Ranges: bytes
    Age: 6575
    X-Cache: HIT from emka1.m10r2.onet
    Via: 1.1 emka1.m10r2.onet:80 (squid)
    Connection: close
    
    Connection closed by foreign host.
    

    这篇关于从网页上传图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆