无法从网页中抓取产品标题 [英] Can't scrape product title from a webpage

查看:46
本文介绍了无法从网页中抓取产品标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在这个

如果我手动或使用 selenium 会话点击按钮,我会收到一个图片验证码,这会增加绕过 Cloudflare 保护的复杂性.

cf_clearance cookie


当解决 Cloudflare CAPTCHA 或 Javascript 挑战时,会在客户端浏览器中设置 cf_clearance cookie.cf_clearance cookie 的默认生命周期为 30 分钟,但可由 Cloudflare 客户端配置.

如果您在 Google Chrome 浏览器中手动打开 OP 的目标 URL,您可以使用 Developer Tools

看到 cf_clearance cookie

似乎 cf_clearance cookie 生存期设置为 60 分钟,具体取决于此会话开始的 UTC 时间和为 cookie 设置的到期日期.

到目前为止,我还没有找到使用 Python 提取此 cookie 的方法.

I'm trying to scrape the title of the product avilable in this webpage using requests module, but the script always throws AttributeError even when the product title is in the page source (ctrl + U).

I've tried with (throws AttributeError):

import requests
from bs4 import BeautifulSoup

link = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}

res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text,"lxml")
try:
    product_title = soup.select_one("h1 > span").get_text(strip=True)
except AttributeError: product_title = ""
print(product_title)

Expected output:

Gigabyte GeForce RTX 3070 Aorus Master 8GB OC GPU

How can I scrape the product title from that webpage?

PS I've tried with this library cloudscraper as well, but no luck.

EDIT:

This is what I get raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url when I run the following piece of code:

import cfscrape

url = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}

token, agent = cfscrape.get_tokens(url, headers=headers)
print(token, agent)

I know I could have used the value of cf_clearance within cookies to access the page content, if I could get the value of token from above attempt.

解决方案

This is only a placeholder for research that might be useful to others looking at this Cloudflare bypass issue.

Use Case


Scraping information from a website that is using either Cloudflare CAPTCHA or Javascript challenge for enhanced protection.

Python Requests


Using a standard Python Requests.Get the Cloudflare service will return a 403 Forbidden error code.

import requests

URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}

response = requests.get(URL, headers=headers)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden

If we look at the response.headers we can see that a Cloudflare server is proxying our request to the target URL.

...continued from the code above
for key, value in response.headers.items():
    print(f'KEY NAME: {key}')
    print(f'KEY VALUE: {value}')
    print('-----------------------')
    # output 
    KEY NAME: Date
    KEY VALUE: Sun, 13 Jun 2021 16:39:03 GMT
    -----------------------
    KEY NAME: Content-Type
    KEY VALUE: text/html; charset=UTF-8
    -----------------------
    KEY NAME: Transfer-Encoding
    KEY VALUE: chunked
    -----------------------
    KEY NAME: Connection
    KEY VALUE: close
    -----------------------
    KEY NAME: Permissions-Policy
    KEY VALUE: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
    -----------------------
    KEY NAME: Cache-Control
    KEY VALUE: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
    -----------------------
    KEY NAME: Expires
    KEY VALUE: Thu, 01 Jan 1970 00:00:01 GMT
    -----------------------
    KEY NAME: X-Frame-Options
    KEY VALUE: SAMEORIGIN
    -----------------------
    KEY NAME: cf-request-id
    KEY VALUE: 0aa7d6c7c4000007ff7201b000000001
    -----------------------
    KEY NAME: Expect-CT
    KEY VALUE: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    -----------------------
    KEY NAME: Set-Cookie
    KEY VALUE: __cf_bm=72427e2af66c7177feeb88a847fae9c26b66c681-1623602343-1800-AZAmqDfaHZU8IXOH/i3BBVf8pGcws0Gc1Tln5yKUepe3utWlCpagxvALDW6wiHd2pli9Zl45Mg8gC/QSoUFhoes=; path=/; expires=Sun, 13-Jun-21 17:09:03 GMT; domain=.cclonline.com; HttpOnly; Secure; SameSite=None
    -----------------------
    KEY NAME: Vary
    KEY VALUE: Accept-Encoding
    -----------------------
    KEY NAME: Server
    KEY VALUE: cloudflare
    -----------------------
    KEY NAME: CF-RAY
    KEY VALUE: 65ecc0b9383b07ff-ATL
    -----------------------
    KEY NAME: Content-Encoding
    KEY VALUE: gzip
    -----------------------

If we look at the response.text associated with the Python Requests we can see other evidence related to the Cloudflare protection.

...continued from the code above
print(response.text)
# output

truncated...

<title>Please Wait... | Cloudflare</title>
<meta name="captcha-bypass" id="captcha-bypass" />

truncated...

<form class="challenge-form managed-form" id="challenge-form" action="/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/?__cf_chl_managed_tk__=7d4597196bb14948881846ca16631b64c55f06d3-1623602854-0-AcX2yHJM2sCalL03Opq9RiFjASeYE0Xs0KG4XeG1lezzhzEyu-bL8xsdHuEjNIIKaJkWEmha4DhViRlqWEP_HREOdA8YAY7nnNkBAHbNMs6p_AWgYNLPnSNM13PO2I96hdABtoaaKjOzV4AyJQJ8f08XEW2flN97rPxIMeiR0tI1a3PiON2dN9E_YCyneAuCUfaYWUNGL0Bqd_rkYp3Ljb2zk_kGWizckr1fvhodSEjEB-ByYVK8ODNox2oZ4XPcmCYJ6UNDmbNc406BjMeTf3e72Z7vgdnt3V714VrGN4w_Y4VQ2X1V0OVKUKEH9B5Rxa_4fEZiMAAdxZ6idg69JYMKftuuLemr53n5WAwTwyX2G7N9jmjtarxEQcCqoj9oY7oSFwQTb3ZVb9i5EeavKaE1_67wxpyPybNidBDxhLazDEMefPZGDsV9mSziuIQ90nS5vn-7sUvC8BJATNWPbh6OduchXy-QcMeYhurtukUCm3oDQMP7r4g4qvDCWI3_-ku7u-B4G2XI2kwM_tLVEZiH5uHPjWpHE6eFWohiCTxd4p7vHg7z5ug9feRalYqu3GfInd82GZ-j-7nCqLDmPh2Sjlu6sJGfopqM3XlBrd1kgRZU3Z4uw6JIIqfH0M6K3_weTtem0-Z1zhDUBbVDvgJVeHNNh_bTxHGWbFB0f80tALBMbt67RftO5u1XBUZ-TRftteXBwJ8gmYzOZTo4lQOGQ_771urYXsTuW_sp8PwxvQpEyCnY8zD8dmVz0-waZhOet8MQMwduN2nfGUOrCMwUYO9McsBqzfsT5PJZVkDm-rYBBwqw0PIwvm1-N8ymAjrpSN6ps4FerqK1uQOo77FLiOq8JCOVqdETIZ9NO07A" method="POST" enctype="application/x-www-form-urlencoded">

truncated...

 <input type="hidden" name="r" value="d5db3eb87c9b42ec7f076916611c296abfd2c842-1623602854-0-AXz7+uyFGbpY1aOLgfZMm0oIiiepEo5I5QmdTnvMmL9fDUc4OMEa2CNYXsbHVjOzdYO+PqegjpNL8R3D9LhDc+Xo0y0ira1zO7foozPj0qdcUpNNr2ZOHqgUyKws6dVgeBNUdF+v9+eNFxSHxOhc4DWDLIw9guBqJg1GaBjG3QCQdZmyFbPxXUQtXTFmtVVuqch9qBFLa/u9deMBCxCWi5fyKoOINtyBtyT4p79ITb9T+6T7fl2epMXNHO6xBW2dPnDP1FmjUQ04CG3ydOaDS5qoSFMPr4InVbMcI2NbQYJYPfWjmncMaga6K+NMNvv8wtiyXpEeWsUgFFeQoDJEuvLI+wkI8mT+vXAnXd8LWy9TpEDVK6uxtLF2C75aU7qJxI9RKANGluWYUXeqE1tXgppgZraIGfRWNPVsQZzqd6SK+Zsg8x8UH7oRRD9blMMPMaekcFQ3zT8QQ5BzEc8wEQ68OhmKbFuAeV/YhhWshpm808gcVHIFH17I+0MEidfV/ny5wBSRZJyQUfOSU9iAv/minNWF6ZA21E/+Zebda2lVF6gyEHgrjecxuOxzY2I2qMm0RCEHO4oSk/X8EtMYirGCQ3FD8PzSvZYx+34QZutXFLVvqT3CR/UcsXybG6wllvIGvZ6j/gdoAwfcS27MyO4mXDMk6TfDqdi+NqlItwgWNdp461RQmPdChRp9kKEy3sTsIAGW9Ky1k/xYYcTvLDpCGFICBEm2JhDyp/FEF9UBYia7XJ4aUEncSUeViqaQ8bXpPk6kEPH5RYEcfaX3he0W5aZHHIGcjgOFZsuu45MWREvbHjO+RcPMib4L+lU1cKQoYx+w5b9e4AJiRnGog3a6E3i/L75bSnk7L3qA+DofeeccI/RPitqDb/lX31fkhwHfdRWoLt+OILsUfHNni/olGABEUDruwDVpR32xlieS7vekdmQL3oOu5BkAOXoObbb+2nzo6Dvgw7M7rb4muC7US4yCTK0BeGSfu2XvFta228IoGIGa8BjUcb09K6nRdWUwrCXLYS+vIJTegKMeyxlMKNXw7vIaPh9vht4zblhN0bqkN/m/opyXEtzLfhsLuEkHdQ0GhTUk2nYgHeKX0j6eW0uQhAD/9TLf6UgILCk0+nQvXfEffQCCe/hEfBfkAgiPhr1E3uyPB4vp6Fpy2nnkkzmGv/3P5wg6afKDmU2Ic32u3U47hOlghnc7NlbzFb5R8Tx6vWrkXMDYHdOaaudLtPp5N9y1ceXXaMNAFMVmoqaiHWuV4KN+2rLolSOGUEFNEoRN6Jw9mlq/zniK23gQ2lSy+wIHPRGvRCxhRr5DeskvLgyviAk7IhLH3zMpqxd7i05BIPV3sB8orBzVE4Rqmam3evpTVEMMFRDt/Ol6XUJi66QrLgJyusuv5xL4pKPWZrw/hn3a5j0zrrChUbvM3S94BeWiJS48hA35S9mXLfaKMAZTYZTMqhbW77qwUuquwW2lPEAgSPY7WvvnNRUPXsS1KCPpiuE0TuDFaZQi9UTqlzkQIq84wqVRjQZ0Y0m3PQeI2BbJZ8woKIKiABWbSOuV/kyy5H4L+RVL7Jmc2ndl3HaQ4XlnwDmTuK/gMbRvZe1taVHOyYsXmfEY4XkiaDUneGjBEGnWyiv49DtiG2TLmmIpP1UITmO677eDSoNLHpxp1guMjwL5m3XHKOFNtpLzuiVH4UJdgTjtnmbGHmKGtyy0k3GPZrwyVkZRyS+FZZ5WhTs05rhS+1sg3oDCyTbWeYX9T4VVswRjxq1HsyH8NdZTN4f9BTn9VU0+9JnVAkgLM4JCkV6wqwQf+QMK/MaYWvBwSjYgFUxdEdT7Rls85/M+4GxcaGsiNmsA5Q==">
  <input type="hidden" name="cf_captcha_kind" value="h">
  <input type="hidden" name="vc" value="4845a44c225a1fa6a61708e11b613971">

truncated...

 <script type="text/javascript">
    //<![CDATA[
    (function(){
        var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
        var trkjs = isIE ? new Image() : document.createElement('img');
        trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=65eccd326d61f331");
        trkjs.id = "trk_managed_js";
        trkjs.setAttribute("alt", "");
        document.body.appendChild(trkjs);
        var cpo=document.createElement('script');
        cpo.type='text/javascript';
        cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=65eccd326d61f331";
        document.getElementsByTagName('head')[0].appendChild(cpo);
    }());
    //]]>
    </script>
  

The information above shows that the Python Requests that was transmitted to the target URL was intercepted by a Cloudflare server, which is challenging the request. This challenge has to be bypassed before the initial request will be allowed to continue.

cfscrape Package


The OP stated that they attempted to use the cfscrape Python Package to obtain token information from the Cloudflare server.

A standard cfscrape request provide identical responses as Python Requests.

import cfscrape

URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}

scraper = cfscrape.create_scraper(delay=10)
response = scraper.get(URL, headers=headers)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden

The cfscrape package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.

From cfscrape source code:

def is_cloudflare_captcha_challenge(resp):
        return (
            resp.status_code == 403
            and resp.headers.get("Server", "").startswith("cloudflare")
            and b"/cdn-cgi/l/chk_captcha" in resp.content
        )


# the function above is called from this

def request(self, method, url, *args, **kwargs):
        resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)

        # Check if Cloudflare captcha challenge is presented
        if self.is_cloudflare_captcha_challenge(resp):
            self.handle_captcha_challenge(resp, url)

        # Check if Cloudflare anti-bot "I'm Under Attack Mode" is enabled
        if self.is_cloudflare_iuam_challenge(resp):
            resp = self.solve_cf_challenge(resp, **kwargs)

        return resp

The handle_captcha_challenge function is what tries to solve the Cloudflare javascript challenge. This section of the code is what is failing. It's unclear what part of that section is failing, so additional research and testing is required.

PLEASE NOTE: According to the package's developer the module is no longer supported.

cloudscraper Package


The OP also stated that they attempted to use the cloudscraper Python Package to obtain token information from the Cloudflare server. It is worth nothing that cloudscraper was forked from cfscrape, so the syntax is similar.

cloudscraper gets the same 403 Forbidden error code as cfscrape.

import cloudscraper

URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}

scraper = cloudscraper.create_scraper()
response = scraper.get(URL)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden

The cloudscraper package also supports the functions get_tokens and get_cookie_string, but both of these produce the 403 Forbidden error code.

selenium Package


The OP also stated that they attempted to use the selenium Python package.

SPECIAL NOTE: During my testing I used selenium with webdrivers for Google Chrome, Mozilla Firefox and Microsoft Edge.

Within the last 12 months these Options could be used in selenium to bypass Cloudflare protection. Unfortunately, these Options do not work today

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
# additional disable-blink-features are available in Chromium source code on Github
chrome_options.add_argument("--disable-blink-features=AutomationControlled")

Below is a selenium code example using the Chrome webdriver with the switches above.

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"

driver.get(URL)

The code above opens a browser session, which is confronted with a Cloudflare Javascript challenge. During testing with the switches mentioned above this challenge does not stop. The Cloudflare Ray ID, which are unique id per request rotate many times before I manually terminated the session.

seleniumwire is required to obtain the status code

Below is a headless mode Chrome webdriver session, which also shows the 403 Forbidden error code for the target URL. The session also shows that hcaptcha.com anti-bot technology is now in the mix.

from seleniumwire import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--headless")
chrome_options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("--disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"
driver.get(URL)

for request in driver.requests:
    print(f'Status Code: {request.response}')
    print(f'Host Name: {request.host}')
    # output 
    Status Code: 403 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 302 
    Host Name: hcaptcha.com
    -----------------------
    Status Code: 200 
    Host Name: newassets.hcaptcha.com
    -----------------------
driver.quit()

A standard Chrome webdriver session using the UI shows an iFrame with an "I am human" checkbox.

If I click the button manually or with selenium session, I'm prompted with a picture captcha, which increasing the complexity of bypassing the Cloudflare protection.

cf_clearance cookie


When a Cloudflare CAPTCHA or Javascript challenge is solved a cf_clearance cookie is set in the client browser. The cf_clearance cookie has a default lifetime of 30 minutes, but is configurable by the Cloudflare client.

If you open the OP's target URL manually in a Google Chrome browser you can see the cf_clearance cookie using Developer Tools

It seem that the cf_clearance cookie lifetime is set for 60 minutes based on the UTC time this session started and the expiration date set for the cookie.

So far I haven't found a way to extract this cookie using Python.

这篇关于无法从网页中抓取产品标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆