如何从目录中的html文件提取图像? [英] How do I extract images from html files in a directory?

查看:101
本文介绍了如何从目录中的html文件提取图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是此问题的后续内容:

This is a followup to this question: How do I parse every html file in a directory for images? Essentially, I have a directory of html files each of which contain images that I would like to save separately in the same directory.

对程序进行建议的更改后,我仍然出现错误:

After making the suggested changes to the program, I am still getting an error:

Image: theme/pfeil_grau.gif

Traceback (most recent call last):
File "C:\Users\gokalraina\Desktop\modfile.py", line 25, in <module>
  im = Image.open(image)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1956, in open
prefix = fp.read(16)
TypeError: 'NoneType' object is not callable

这是我正在使用的修改后的代码(感谢夜莺").

This is the revised code (thanks to nightcracker) that I am using.

 import os, os.path
 import Image
 from BeautifulSoup import BeautifulSoup as bs

  path = 'C:\Users\gokalraina\Desktop\derm images'

 for root, dirs, files in os.walk(path):
    for f in files:
      soup = bs(open(os.path.join(root, f)).read())
      for image in soup.findAll("img"):
        print "Image: %(src)s" % image
        im = Image.open(image)
        im.save(path+image["src"], "JPEG")

推荐答案

代码将BeautifulSoup.Tag对象传递给Image.open,但是Image.open需要路径或文件对象.您可以使用image["src"]获取图像的相对路径,因此代码为:

The code is passing a BeautifulSoup.Tag object to Image.open, but Image.open is expecting a path or a file object. You can get the relative path to the image with image["src"], so the code would be:

im = Image.open(image["src"])

但是,该路径与HTML文件中写入的路径相同,这可能是从HTML文件目录开始的相对路径.如果是这样,将rootimage["src"]连接起来将获得每个图像的绝对路径:

However, that path is the same path written in the HTML file, which is probably a relative path starting from the HTML file's directory. If so, joining root to image["src"] will get the absolute path for each image:

im = Image.open(os.path.join(root, image["src"]))

这篇关于如何从目录中的html文件提取图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆