使用bs4抓取动态内容 [英] Scrape dynamic contents using bs4

查看:119
本文介绍了使用bs4抓取动态内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从mobile_comparison_web网站上抓取一些信息.但它的内容看起来很动态.我正在尝试使用硒来抓取动态内容,但它也没有给我预期的输出.

i'm scraping some information from mobile_comparison_website. but it's content are looking dynamic. I'm trying to scraping the dynamic content using selenium but its also does not given me expected output.

from bs4 import BeautifulSoup as bs
from selenium import webdriver
path = r'C:\\Users\\Goku\\Downloads\\Compressed\\chromedriver'

driver = webdriver.Chrome(path)

driver.get('https://versus.com/en')

res = driver.execute_script("return document.documentElement.outerHTML")

soup = bs(res, 'lxml')
box = soup.find('div', {'class':'CarouList__carouList___2WspW 
CarouList__isLandingPage___rPe4J'})

print(box)

例如-我要抓取div内的所有图像和名称

推荐答案

您可以在html源代码中的< script> 标记下找到数据.找到该文本,将字符串处理为有效的json格式,然后使用 json.loads()进行读入.然后,您可以浏览该结构并取出所需的内容.图片的网址在这里找到:

You can find data within the html source code under the <script> tag. find that text, manipulate the string into a valid json format, then use json.loads() to read that in. Then you can have a look around that structure and pull out what you want. The url of the images are found there:

import requests
from bs4 import BeautifulSoup as soup
import json

my_url = 'https://versus.com/en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

# opening up connection, grabbing the page
response = requests.get(my_url, headers=headers)

#html parsing
page_soup = soup(response.text, "html.parser")

scripts = page_soup.find_all('script')
for script in scripts:
   if 'window.__data=' in script.text:
       jsonStr = script.text
       jsonStr = jsonStr.split('window.__data=')[-1]

       jsonData = json.loads(jsonStr)

phones = jsonData['landing']['trendings']['phone']['list']
for each in phones:
    root_url = 'https://versus.dadi.network'
    popImage = root_url + each['popImage']
    rivalImage = root_url + each['rivalImage']

    print ('%s\n%s' %(popImage, rivalImage))

输出:

https://versus.dadi.network/samsung-galaxy-a9-2018/front/front-1539337417084.variety.jpg
https://versus.dadi.network/samsung-galaxy-a50/front/front-1551183669492.variety.jpg
https://versus.dadi.network/samsung-galaxy-s10-plus/front/front-1550699605210.variety.jpg
https://versus.dadi.network/apple-iphone-xs-max/front/front-1536781345067.variety.jpg
https://versus.dadi.network/samsung-galaxy-a50/front/front-1551183669492.variety.jpg
https://versus.dadi.network/huawei-p30-lite/front/front-1555000229505.variety.jpg
https://versus.dadi.network/xiaomi-redmi-note-7/front/front-1550507767671.variety.jpg
https://versus.dadi.network/xiaomi-mi-8-lite/front/front-1537824165879.variety.jpg
https://versus.dadi.network/samsung-galaxy-s8/front/front-1490950798404.variety.jpg
https://versus.dadi.network/samsung-galaxy-a50/front/front-1551183669492.variety.jpg
https://versus.dadi.network/huawei-p20-lite/front/front-1521538430205.variety.jpg
https://versus.dadi.network/huawei-p-smart-2019/front/front-1547733931933.variety.jpg
https://versus.dadi.network/samsung-galaxy-a50/front/front-1551183669492.variety.jpg
https://versus.dadi.network/samsung-galaxy-a30/front/front-1551187893794.variety.jpg
https://versus.dadi.network/samsung-galaxy-m20/front/front-1550059143173.variety.jpg
https://versus.dadi.network/samsung-galaxy-a30/front/front-1551187893794.variety.jpg
https://versus.dadi.network/oneplus-6t/front/front-1540985964061.variety.jpg
https://versus.dadi.network/google-pixel-3/front/front-1539114763774.variety.jpg
https://versus.dadi.network/samsung-galaxy-a40/front/front-1555086727000.variety.jpg
https://versus.dadi.network/huawei-p20-lite/front/front-1521538430205.variety.jpg

这篇关于使用bs4抓取动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆