从两个不同的BeautifulSoap输出中打印URL [英] Print URL from two different BeautifulSoap outputs

查看:38
本文介绍了从两个不同的BeautifulSoap输出中打印URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoap批量抓取一些URL.

I am scraping a few URLs in batch using BeautifulSoap.

这是我的脚本(仅相关的内容):

Here is my script (only relevant stuff):

import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://example.com/foo/bar'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
url_box = soup.find('div', attrs={'class': 'player'})
print url_box

这会根据URL的HTML进行2种不同的打印(大约一半的页面用于第一页打印,其余的用于第二页打印).

This gives 2 different kinds of print depending on the HTML of URL (about half pages gives first print and rest give the second print).

这是第一种打印方式:

<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>

这是另一个:

<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>

我想提取图像URL,该URL首先是 poster ,然后是 src .

I want to extract the image URL which is poster in first and src in second.

有什么主意如何做到这一点,以便相同的脚本从任何一种打印物中提取该URL?

Any ideas how I can do that so same script extracts that URL from either kind of print?

PS的第一张照片也有一个我不需要的mp4链接.

P.S The first print also has a mp4 link which I do not need.

推荐答案

您可以使用 get()方法从目标标记中获取attrs的值.

You can use the get() method to get the value of attrs from the targeted tag.

您应该可以执行以下操作:

You should be able to do something like this:

if url_box.find('video'):
    url = url_box.find('video').get('poster')
    mp4 = ulr_box.find('span').get('data-url')
if url_box.find('img'):
    url = url_box.find('img').get('src')

这篇关于从两个不同的BeautifulSoap输出中打印URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆