发布用Selenium和python抓取javascript生成的内容 [英] Issue scraping javascript generated content with Selenium and python

查看：116 发布时间：2019/6/8 19:20:54 javascript python selenium web-scraping

本文介绍了发布用Selenium和python抓取javascript生成的内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正试图从这个网站上删除房地产数据：示例
如您所见，相关内容已被放置到文章标签。

I'm trying to scrape real estate data off of this website: example As you can see the relevant content is placed into article tags.

我用phantomjs运行selenium：

I'm running selenium with phantomjs:

driver = webdriver.PhantomJS(executable_path=PJSpath)

然后我在python中生成URL，因为所有搜索结果是链接的一部分，所以我可以在程序中搜索我正在寻找的内容，而无需填写表格。

Then I generate the URL in python, because all search results are part of the link, so I can search what I'm looking for in the program without needing to fill out the form.

致电之前

driver.get(engine_link)

我将engine_link复制到剪贴板，它在chrome中打开很好。
接下来我等待所有可能的重定向发生：

I copy engine_link to the clipboard and it opens fine in chrome. Next I wait for all possible redirects to happen:

def wait_for_redirect(wdriver):
    elem = wdriver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 5:
            print("Waited for redirect for 5 seconds!")
            return
        time.sleep(1)
        try:
            elem = wdriver.find_element_by_tag_name("html")
        except StaleElementReferenceException:
            return

现在我想要迭代当前页面上的所有< article> 标签：

Now at last I want to iterate over all <article> tags on the current page:

for article in driver.find_elements_by_tag_name("article"):

但是这个循环永远不会返回任何内容。该程序没有找到任何文章标签，我已经尝试使用xpath和css选择器。此外，文章都包含在一个部分标签中，也找不到。

But this loop never returns anything. The program doesn't find any article tags, I've tried it with xpath and css selectors. Moreover, the articles are enclosed in a section tag, that can't be found either.

Selenium中这种特定类型的标签是否存在问题或者我遗失了JS在这里有什么关系？在页面底部有JavaScript模板，其命名表明它们会生成搜索结果。

Is there a problem with this specific type of tags in Selenium or am I missing something JS related here? At the bottom of the page there are JavaScript templates whose naming suggests that they generate the search results.

任何帮助表示感谢！

推荐答案

假装不是 PhantomJS 并添加显式等待（为me）：

from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# set a custom user-agent
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent

driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get("http://www.seloger.com/list.htm?cp=40250&org=advanced_search&idtt=2&pxmin=50000&pxmax=200000&surfacemin=20&surfacemax=100&idtypebien=2&idtypebien=1&idtypebien=11")

# wait for arcitles to be present
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "article")))

# get articles
for article in driver.find_elements_by_tag_name("article"):
    print(article.text)

这篇关于发布用Selenium和python抓取javascript生成的内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

发布用Selenium和python抓取javascript生成的内容 [英] Issue scraping javascript generated content with Selenium and python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

发布用Selenium和python抓取javascript生成的内容 [英] Issue scraping javascript generated content with Selenium and python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭