如何一起抓取文本和图像? [英] how to scraep text and image together?
问题描述
我正在使用 beautifulSoup4 开发网页抓取工具.我想获取文章的文字和图片,但是有一些问题!html代码是这样的:
一些文本1<br/><img src="imgpic.jpg"/><br/>一些文字2
我得到了整个文本:
post_soup.get_text()
并像往常一样使用 urllib2
将所有图像保存在 div
中最后我将它们保存在一个 html 页面中,最后将所有文本和图像放在顶部,但我想将它们保存在新的 html 页面中,就像我抓取它们的页面一样,我的意思是先 some texts1
然后image
然后 some texts2
请问有什么建议吗?
这不是最好和正确的方法,但它应该有效:
from bs4 import BeautifulSouphtml = "\一些文字1\<br/>\<img src=\"imgpic.jpg\"/>\<br/>\一些文字2\</div>"汤 = BeautifulSoup(html)text = "+".join(soup.stripped_strings).split("+")打印文本[0]打印汤.find("img")['src']打印文本[1]输出:
一些文字1图片.jpg一些文字2
I'm working on a webpage scraper with beautifulSoup4. I want to get text and images of the article, but have some problems!
html code is sth like this:
<div>
some texts1
<br />
<img src="imgpic.jpg" />
<br />
some texts2
</div>
I get the whole texts with this :
post_soup.get_text()
and save all images in div
with urllib2
as usual
finally I save them in a html page and put all text at top and images at last, but I want to save them in new html page just like the page I scraped them, I mean first some texts1
then image
then some texts2
any suggestions please?
解决方案 This is not the best and correct way, but it should work:
from bs4 import BeautifulSoup
html = "<div>\
some texts1\
<br />\
<img src=\"imgpic.jpg\" />\
<br />\
some texts2\
</div>"
soup = BeautifulSoup(html)
text = "+".join(soup.stripped_strings).split("+")
print text[0]
print soup.find("img")['src']
print text[1]
Output:
some texts1
imgpic.jpg
some texts2
这篇关于如何一起抓取文本和图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文