为什么请求返回的 HTML 与真实页面的 HTML 不同? [英] Why is HTML returned by requests different from the real page HTML?

查看:82
本文介绍了为什么请求返回的 HTML 与真实页面的 HTML 不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取网页以获取一些数据,我想抓取的网页之一是这个 https://www.etoro.com/people/sparkliang/portfolio,当我使用以下方法抓取网页时出现问题:

I'm trying to scrape a webpage for getting some data to work with, one of the web pages I want to scrape is this one https://www.etoro.com/people/sparkliang/portfolio, the problem comes when I scrape the web page using:

import requests

h=requests.get('https://www.etoro.com/people/sparkliang/portfolio')
h.content

并为我提供了与原始 HTML 完全不同的结果,例如添加了很多元类型并删除了我正在搜索的文本或类型 HTML 变量.

And gives me a completely different result HTML from the original, for example adding a lot of meta kind and deleting the text or type HTML variables I am searching for.

例如想象我想刮:

<p ng-if=":: item.IsStock" class="i-portfolio-table-hat-fullname ng-binding ng-scope">Shopify Inc.</p>

我使用这样的命令:

    from bs4 import BeautifulSoup

    import requests

    html_text = requests.get('https://www.etoro.com/people/sparkliang/portfolio').text
    print(html_text)

    soup = BeautifulSoup(html_text,'lxml')

    job = soup.find('p', class_='i-portfolio-table-hat-fullname ng-binding ng-scope').text
    

这将返回我 Shopify Inc.但这并不是因为 html 代码 y 加载或从带有请求库的网页获取,让我得到另一个完全不同的 html.

This will return me Shopify Inc. But it doesn't because the html code y load or get from the web page with the requests' library, gets me another complete different html.

我想知道如何从网页中获取原始html代码.如果您使用 cntl-f 搜索诸如 Shopify Inc 之类的关键字,它甚至不会出现在我从请求 python 库中获得的代码中

I want to know how to get the original html code from the web page. If you use cntl-f for searching to a keyword like Shopify Inc it wont be even in the code i get from the requests python library

推荐答案

发生这种情况是因为页面使用动态 javascript 来创建 DOM 元素.因此,您将无法使用请求来完成它.相反,您应该将 selenium 与 webdriver 一起使用,并在抓取之前等待元素被创建.

It happens because the page uses dynamic javascript to create the DOM elements. So you won't be able to accomplish it using requests. Instead you should use selenium with a webdriver and wait for the elements to be created before scraping.

您可以尝试在此处下载 ChromeDriver 可执行文件.如果您将其粘贴到与脚本相同的文件夹中,则可以运行:

You can try downloading ChromeDriver executable here. And if you paste it in the same folder as your script you can run:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)

url = 'https://www.etoro.com/people/sparkliang/portfolio'
driver.get(url)
html_text = driver.page_source

jobs = WebDriverWait(driver, 20).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.i-portfolio-table-hat-fullname'))
)
for job in jobs:
    print(job.text)

这里我们将 selenium 与 WebDriverWaitEC 结合使用,以确保当我们尝试抓取我们正在寻找的信息时所有元素都存在.

Here we use selenium with WebDriverWait and EC to ensure that all the elements wil exist when we try to scrape the info we're looking for.

Facebook
Apple
Walt Disney
Alibaba
JD.com
Mastercard
...

这篇关于为什么请求返回的 HTML 与真实页面的 HTML 不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆