如何一起抓取文本和图像? [英] how to scraep text and image together?

查看：58 发布时间：2021/6/26 20:00:14 python-2.7 web-scraping beautifulsoup

本文介绍了如何一起抓取文本和图像?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 beautifulSoup4 开发网页抓取工具.我想获取文章的文字和图片，但是有一些问题！html代码是这样的:


一些文本1<br/><img src="imgpic.jpg"/><br/>一些文字2

我得到了整个文本:

post_soup.get_text()

并像往常一样使用 urllib2 将所有图像保存在 div 中最后我将它们保存在一个 html 页面中，最后将所有文本和图像放在顶部，但我想将它们保存在新的 html 页面中，就像我抓取它们的页面一样，我的意思是先 some texts1 然后image 然后 some texts2

请问有什么建议吗?

解决方案

这不是最好和正确的方法，但它应该有效:

from bs4 import BeautifulSouphtml = "\一些文字1\<br/>\<img src=\"imgpic.jpg\"/>\<br/>\一些文字2\</div>"汤 = BeautifulSoup(html)text = "+".join(soup.stripped_strings).split("+")打印文本[0]打印汤.find("img")['src']打印文本[1]

输出:

一些文字1图片.jpg一些文字2

I'm working on a webpage scraper with beautifulSoup4. I want to get text and images of the article, but have some problems! html code is sth like this:

<div>
 some texts1
 <br />
 <img src="imgpic.jpg" />
 <br />
 some texts2
</div>

I get the whole texts with this :

post_soup.get_text()

and save all images in div with urllib2 as usual finally I save them in a html page and put all text at top and images at last, but I want to save them in new html page just like the page I scraped them, I mean first some texts1 then image then some texts2

any suggestions please?

解决方案

This is not the best and correct way, but it should work:

from bs4 import BeautifulSoup

html = "<div>\
 some texts1\
 <br />\
 <img src=\"imgpic.jpg\" />\
 <br />\
 some texts2\
</div>"

soup = BeautifulSoup(html)
text = "+".join(soup.stripped_strings).split("+")

print text[0]
print soup.find("img")['src']
print text[1]

Output:

some texts1
imgpic.jpg
some texts2

这篇关于如何一起抓取文本和图像?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何一起抓取文本和图像? [英] how to scraep text and image together?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何一起抓取文本和图像? [英] how to scraep text and image together?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭