抓取使用 React JS 和 BeautifulSoup 呈现的元素 [英] Scraping elements rendered using React JS with BeautifulSoup

查看:23
本文介绍了抓取使用 React JS 和 BeautifulSoup 呈现的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从这个特定页面的搜索结果中抓取 class="_1UoZlX" 的锚链接 - https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4io

I want to scrape anchor links with class="_1UoZlX" from the search results from this particular page - https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4io

当我从页面创建汤时,我意识到搜索结果是使用 React JS 呈现的,因此我无法在页面源代码(或汤中)中找到它们.

When I created a soup from the page I realised that the search results are being rendered using React JS and hence I can't find them in the page source (or in the soup).

这是我的代码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


listUrls = ['https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4iof']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
urls=[]

for url in listUrls:
    browser.get(url)
    wait = WebDriverWait(browser, 20)
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
    soup = BeautifulSoup(browser.page_source,"html.parser")
    results = soup.findAll('a',{'class':"_1UoZlX"})
    for result in results:
        link = result["href"]
        print link
        urls.append(link)
    print urls

这是我得到的错误.

Traceback (most recent call last):
  File "fetch_urls.py", line 19, in <module>
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
  File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 
Screenshot: available via screen

有人在这个答案中提到有是一种使用 selenium 处理页面上的 javascript 的方法.有人可以详细说明吗?我进行了一些谷歌搜索,但找不到适用于这种特殊情况的方法.

Someone mentioned in this answer that there is a way to use selenium to process the javascript on a page. Can someone elaborate on that? I did some googling but couldn't find an approach that works for this particular case.

推荐答案

您的代码没有问题,但您正在抓取的网站 - 它不会因某种原因停止加载,这会阻止您解析页面和后续代码写了.

There is no problem with your code but the website you are scraping - it does not stop loading for some reason that prevents the parsing of the page and subsequent code you wrote.

我尝试使用维基百科确认相同:

I tried with wikipedia to confirm the same:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

listUrls = ["https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"]
# browser = webdriver.PhantomJS('/usr/local/bin/phantomjs')
browser = webdriver.Chrome("./chromedriver")
urls=[]

for url in listUrls:
    browser.get(url)
    soup = BeautifulSoup(browser.page_source,"html.parser")
    results = soup.findAll('a',{'class':"mw-redirect"})
    for result in results:
        link = result["href"]
        urls.append(link)
    print urls

输出:

[u'/wiki/List_of_states_and_territories_of_India_by_area', u'/wiki/List_of_Indian_states_by_GDP_per_capita', u'/wiki/Constitutional_republic', u'/wiki/States_and_territories_of_India', u'/wiki/National_Capital_Territory_of_Delhi', u'/wiki/States_Reorganisation_Act', u'/wiki/High_Courts_of_India', u'/wiki/Delhi_NCT', u'/wiki/Bengaluru', u'/wiki/Madras', u'/wiki/Andhra_Pradesh_Capital_City', u'/wiki/States_and_territories_of_India', u'/wiki/Jammu_(city)']

附言我正在使用 chrome 驱动程序,以便针对真正的 chrome 浏览器运行脚本以进行调试.从 https://chromedriver.storage.googleapis.com/下载 Chrome 驱动程序index.html?path=2.27/

P.S. I'm using a chrome driver in order to run the script against the real chrome browser for debugging purposes. Download the chrome driver from https://chromedriver.storage.googleapis.com/index.html?path=2.27/

这篇关于抓取使用 React JS 和 BeautifulSoup 呈现的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆