Scrapy/Splash 单击一个按钮,然后在新窗口中从新页面获取内容 [英] Scrapy/Splash Click on a button then get content from new page in new window

查看:27
本文介绍了Scrapy/Splash 单击一个按钮,然后在新窗口中从新页面获取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,当我点击一个按钮时,Javascript 处理动作然后它重定向到一个带有新窗口的新页面(这类似于当你点击 与目标 _Blank).在scrapy/splash中,我不知道如何从新页面获取内容(我的意思是我不知道如何控制那个新页面).

I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a> with target _Blank). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page).

任何人都可以提供帮助!

Anyone can help!

script = """
    function main(splash)
        assert(splash:go(splash.args.url))
        splash:wait(0.5)
        local element = splash:select('div.result-content-columns div.result-title')
        local bounds = element:bounds()
        element:mouse_click{x=bounds.width/2, y=bounds.height/2}
        return splash:html()
    end
"""

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': self.script})

推荐答案

问题:

无法抓取超出选择范围的 html 的问题.当一个新链接被点击时,如果涉及到一个 iframe,它很少会将其纳入抓取范围.

Issue:

The problem that you can't scrape html which is out of your selection scope. When a new link is clicked, if there is an iframe involved, it rarely brings it into scope for scraping.

选择一种选择新 iframe 的方法,然后继续解析新的 html.

Choose a method of selecting the new iframe, and then proceed to parse the new html.

(这是对来自 这个答案的 Mikhail Korobov 解决方案的改编)

如果你能拿到弹出的新页面的src链接,那可能是最靠谱的,不过你也可以试试这样选择iframe:

If you are able to get the src link of the new page that pops up, it may be the most reliable, however, you can also try selecting iframe this way:

# ...
    yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args={'html': 1, 'iframes': 1})

def parse_result(self, response):
    iframe_html = response.data['childFrames'][0]['html']
    sel = parsel.Selector(iframe_html)
    item = {
        'my_field': sel.xpath(...),
        # ...  
    }

硒方法

(需要 pip install selenium、bs4,可能还需要从这里为您的操作系统下载 chrome 驱动程序:Selenium Chromedrivers) 支持Javascript解析!哇哦!

The Selenium method

(requires pip install selenium,bs4, and possibly a chrome driver download from here for your os: Selenium Chromedrivers) Supports Javascript parsing! Woohoo!

使用以下代码,这会将范围切换到新框架:

With the following code, this will switch scopes to the new frame:

# Goes at the top
from bs4 import BeautifulSoup 
from selenium.webdriver.chrome.options import Options
import time

# Your path depends on where you downloaded/located your chromedriver.exe
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"

chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless") # Speeds things up if you don't need gui
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)

chrome_options.binary_location = CHROME_PATH

browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)

url = "example_js_site.com" # Your site goes here
browser.get(url)
time.sleep(3) # An unsophisticated way to wait for the new page to load.
browser.switch_to.frame(0)

soup = BeautifulSoup(browser.page_source.encode('utf-8').strip(), 'lxml')

# This will return any content found in tags called '<table>'
table = soup.find_all('table') 

在这两个选项中,我最喜欢的是 Selenium,但如果您更喜欢第一个解决方案,请尝试使用它!

My favorite of the two options is Selenium, but try the first solution if you are more comfortable with it!

这篇关于Scrapy/Splash 单击一个按钮,然后在新窗口中从新页面获取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆