使用美丽的汤刮取图像 [英] Scraping images using beautiful soup

查看:42
本文介绍了使用美丽的汤刮取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用精美的汤从文章中刮取图像.它似乎可以工作,但是我无法打开图像.每次尝试从桌面访问图像时,都会出现文件格式错误.有什么见识吗?

I'm trying to scrape the image from an article using beautiful soup. It seems to work but I can't open the image. I get a file format error every time I try to access the image from my desktop. Any insights?

timestamp = time.asctime() 

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Create a new file to write content to
txt = open('%s.jpg' % timestamp, "wb")

# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
    link = link["src"].split("src=")[-1]
    download_img = urllib2.urlopen(link)
    txt.write('\n' + "Image(s): " + download_img.read() + '\n' + '\n')

txt.close()

推荐答案

您正在为每张图像的数据开头添加新行和文本,从而从本质上破坏了数据.

You are appending a new line and text to the start of the data for every image, essentially corrupting it.

此外,您正在将每个图像写入同一文件,从而再次损坏它们.

Also, you are writing every image into the same file, again corrupting them.

将用于写入文件的逻辑放入循环中,不要在图像中添加任何额外的数据,它应该可以正常工作.

Put the logic for writing the files inside the loop, and don't add any extra data to the images and it should work fine.

# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
    timestamp = time.asctime() 
    txt = open('%s.jpg' % timestamp, "wb")
    link = link["src"].split("src=")[-1]
    download_img = urllib2.urlopen(link)
    txt.write(download_img.read())

    txt.close()

这篇关于使用美丽的汤刮取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆