BeautifulSoup无法读取请求获得的“完整" HTML [英] BeautifulSoup does not read 'full' HTML obtained by requests

查看:80
本文介绍了BeautifulSoup无法读取请求获得的“完整" HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用BeautifulSoup从呈现为HTML的网站上抓取URL并请求库.我都在Python 3.5上运行它们.看来我已成功从请求中获取HTML,因为当我显示r.content时,会显示我要抓取的网站的完整HTML.但是,当我将其传递给BeautifulSoup时,BeautifulSoup会删除HTML的大部分内容,包括我要抓取的URL.

I am trying to scrape URL's from a website presented as HTML using the BeautifulSoup and requests libraries. I am running both of them on Python 3.5. It seems I am succesfully getting the HTML from requests because when I display r.content, the full HTML of the website I am trying to scrape is displayed. However, when I pass this to BeautifulSoup, BeautifulSoup drops the bulk of the HTML, including the URL I am trying to scrape.

from bs4 import BeautifulSoup
import requests

page = requests.get('www.example.com')
soup = BeautifulSoup(page.content, 'html.parser')

print(soup.findAll('div'))

我已经尝试使用html5lib,lxml等其他解析器,但未成功.

I already tried using other parsers like html5lib, lxml already without any success.

但是,输出结果并未显示网站HTML代码中的所有"div".

However, the output does not show all the 'div' that are actually on the website's HTML code.

这是我要从"h1.post-title"中抓取网址.

I want to scrape the URL from 'h1.post-title'.

推荐答案

这是因为您要抓取的页面是动态的.这意味着其内容是使用JavaScript生成的,并且需要花费一些时间才能完全呈现它(最初不是静态呈现的).

This is because the page you're scraping is dynamic. Meaning that its content is generated with JavaScript and it takes some times to fully render it (not initially present statically).

您应使用类似 Selenium

You should use something like Selenium or Puppeteer to load the page, wait for it to fully render, then scrape the content you need to extract.

这篇关于BeautifulSoup无法读取请求获得的“完整" HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆