发布用Selenium和python抓取javascript生成的内容 [英] Issue scraping javascript generated content with Selenium and python

查看:116
本文介绍了发布用Selenium和python抓取javascript生成的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图从这个网站上删除房地产数据:示例
如您所见,相关内容已被放置到文章标签。

I'm trying to scrape real estate data off of this website: example As you can see the relevant content is placed into article tags.

我用phantomjs运行selenium:

I'm running selenium with phantomjs:

driver = webdriver.PhantomJS(executable_path=PJSpath)

然后我在python中生成URL,因为所有搜索结果是链接的一部分,所以我可以在程序中搜索我正在寻找的内容,而无需填写表格。

Then I generate the URL in python, because all search results are part of the link, so I can search what I'm looking for in the program without needing to fill out the form.

致电之前

driver.get(engine_link)

我将engine_link复制到剪贴板,它在chrome中打开很好。
接下来我等待所有可能的重定向发生:

I copy engine_link to the clipboard and it opens fine in chrome. Next I wait for all possible redirects to happen:

def wait_for_redirect(wdriver):
    elem = wdriver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 5:
            print("Waited for redirect for 5 seconds!")
            return
        time.sleep(1)
        try:
            elem = wdriver.find_element_by_tag_name("html")
        except StaleElementReferenceException:
            return

现在我想要迭代当前页面上的所有< article> 标签:

Now at last I want to iterate over all <article> tags on the current page:

for article in driver.find_elements_by_tag_name("article"):

但是这个循环永远不会返回任何内容。该程序没有找到任何文章标签,我已经尝试使用xpath和css选择器。此外,文章都包含在一个部分标签中,也找不到。

But this loop never returns anything. The program doesn't find any article tags, I've tried it with xpath and css selectors. Moreover, the articles are enclosed in a section tag, that can't be found either.

Selenium中这种特定类型的标签是否存在问题或者我遗失了JS在这里有什么关系?在页面底部有JavaScript模板,其命名表明它们会生成搜索结果。

Is there a problem with this specific type of tags in Selenium or am I missing something JS related here? At the bottom of the page there are JavaScript templates whose naming suggests that they generate the search results.

任何帮助表示感谢!

推荐答案

假装不是 PhantomJS 并添加显式等待(为me):

from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# set a custom user-agent
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent

driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get("http://www.seloger.com/list.htm?cp=40250&org=advanced_search&idtt=2&pxmin=50000&pxmax=200000&surfacemin=20&surfacemax=100&idtypebien=2&idtypebien=1&idtypebien=11")

# wait for arcitles to be present
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "article")))

# get articles
for article in driver.find_elements_by_tag_name("article"):
    print(article.text)

这篇关于发布用Selenium和python抓取javascript生成的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆