如何从目录中的html文件提取图像? [英] How do I extract images from html files in a directory?
问题描述
This is a followup to this question: How do I parse every html file in a directory for images? Essentially, I have a directory of html files each of which contain images that I would like to save separately in the same directory.
对程序进行建议的更改后,我仍然出现错误:
After making the suggested changes to the program, I am still getting an error:
Image: theme/pfeil_grau.gif
Traceback (most recent call last):
File "C:\Users\gokalraina\Desktop\modfile.py", line 25, in <module>
im = Image.open(image)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1956, in open
prefix = fp.read(16)
TypeError: 'NoneType' object is not callable
这是我正在使用的修改后的代码(感谢夜莺").
This is the revised code (thanks to nightcracker) that I am using.
import os, os.path
import Image
from BeautifulSoup import BeautifulSoup as bs
path = 'C:\Users\gokalraina\Desktop\derm images'
for root, dirs, files in os.walk(path):
for f in files:
soup = bs(open(os.path.join(root, f)).read())
for image in soup.findAll("img"):
print "Image: %(src)s" % image
im = Image.open(image)
im.save(path+image["src"], "JPEG")
推荐答案
代码将BeautifulSoup.Tag
对象传递给Image.open
,但是Image.open
需要路径或文件对象.您可以使用image["src"]
获取图像的相对路径,因此代码为:
The code is passing a BeautifulSoup.Tag
object to Image.open
, but Image.open
is expecting a path or a file object. You can get the relative path to the image with image["src"]
, so the code would be:
im = Image.open(image["src"])
但是,该路径与HTML文件中写入的路径相同,这可能是从HTML文件目录开始的相对路径.如果是这样,将root
与image["src"]
连接起来将获得每个图像的绝对路径:
However, that path is the same path written in the HTML file, which is probably a relative path starting from the HTML file's directory. If so, joining root
to image["src"]
will get the absolute path for each image:
im = Image.open(os.path.join(root, image["src"]))
这篇关于如何从目录中的html文件提取图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!