如何使用带无头服务器且无GUI的python获取/获取聚合物spa网页 [英] how to fetch / grab polymer spa webpage by using python with headless server and no GUI

查看:90
本文介绍了如何使用带无头服务器且无GUI的python获取/获取聚合物spa网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取以下URL的内容: https://docs-05- dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html

I'm trying to grab the content of the following url: https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html

我的目标是抓取访问者看到的网页内容(源代码),以便在呈现所有javascript等之后.

My goal is to grab the content (source code) of the webpage as seen by the visitor, so after it has rendered all javascripts etc.

为此,我使用了此处提到的示例: http://techstonia .com/scraping-with-phantomjs-and-python.html

To do so I used the example mentioned here:http://techstonia.com/scraping-with-phantomjs-and-python.html

该示例在我的服务器上有效.但是,挑战在于如何使其也能像上面提到的那样,在基于聚合物的SPA站点上工作.这些是真正呈现的javascript网站.

That example works on my server. But the challenge is to also have it work for polymer based SPA sites like the one mentioned. Those are really rendered javascript websites.

我的代码如下:

import platform
from bs4 import BeautifulSoup
from selenium import webdriver

# PhantomJS files have different extensions
# under different operating systems
if platform.system() == 'Windows':
    PHANTOMJS_PATH = './phantomjs.exe'
else:
    PHANTOMJS_PATH = './phantomjs'


# here we'll use pseudo browser PhantomJS,
# but browser can be replaced with browser = webdriver.FireFox(),
# which is good for debugging.
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html')
print (browser)

问题在于,它会产生以下结果:

The issue is that is delivers the following result:

<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<meta content="width=device-width, minimum-scale=1.0, initial-scale=1.0, user-scalable=yes" name="viewport">
<title>Single page app using Polymer</title>
<script async="" src="//www.google-analytics.com/analytics.js"></script><script src="/webcomponents.min.js"></script>
<!-- vulcanized version of imported elements --
       see "elements.html" for unvulcanized list of imports. -->
<link href="vulcanized.html" rel="import">
<link href="styles.css" rel="stylesheet" shim-shadowdom="">
</link></link></meta></meta></head>
<body fullbleed="" unresolved="">
<template id="t" is="auto-binding">
<!-- Route controller. -->
<flatiron-director autohash="" route="{{route}}"></flatiron-director>
<!-- Keyboard nav controller. -->
<core-a11y-keys id="keys" keys="up down left right space space+shift" on-keys-pressed="{{keyHandler}}" target="{{parentElement}}"></core-a11y-keys>
<core-scaffold id="scaffold">
<nav>
<core-toolbar>
<span>Single Page Polymer</span>
</core-toolbar>
<core-menu on-core-select="{{menuItemSelected}}" selected="{{route}}" selectedmodel="{{selectedPage}}" valueattr="hash">
<template repeat="{{page, i in pages}}">
<paper-item hash="{{page.hash}}" noink="">
<core-icon icon="label{{route != page.hash ? '-outline' : ''}}"></core-icon>
<a href="#{{page.hash}}">{{page.name}}</a>
</paper-item>
</template>
</core-menu>
</nav>
<core-toolbar flex="" tool="">
<div flex="">{{selectedPage.page.name}}</div>
<core-icon-button icon="refresh"></core-icon-button>
<core-icon-button icon="add"></core-icon-button>
</core-toolbar>
<div center-center="" fit="" horizontal="" layout="">
<core-animated-pages id="pages" on-tap="{{cyclePages}}" selected="{{route}}" transitions="slide-from-right" valueattr="hash">
<template repeat="{{page, i in pages}}">
<section center-center="" hash="{{page.hash}}" layout="" vertical="">
<div>{{page.name}}</div>
</section>
</template>
</core-animated-pages>
</div>
</core-scaffold>
</template>
<script src="app.js"></script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-43475701-2', 'auto'); // ebidel's
  ga('create', 'UA-39334307-1', 'auto'); // pp.org
  ga('send', 'pageview');
</script>
</body></html>

当您使用浏览器查看时,所看到的与实际结果相差甚远. 我有问题....我该怎么办?如果可能的话,在哪里寻找解决方案.

As you see far from the real result you see when looking with your browser. The questions I have.... What do I do wrong and if possible where to look for the solution.

推荐答案

我认为您缺少 Selenium Webdriver文档. 您可以获取动态页面的内容,但是必须确保要搜索的元素在页面上存在并且可见:

I think you are missing something from the Selenium Webdriver docs. You can get the content of a dynamic page, but you have to make sure that the element you are searching is present and visible on the page:

import platform
from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get('https://docs-05-dot-polymer-
project.appspot.com/0.5/articles/demos/spa/final.html')

# Getting content of the first slide
res1 = browser.find_element_by_xpath('//*[@id="pages"]/section[1]/div')

# Save a screenshot so you can see why is failing (if it is)
browser.save_screenshot('screen_test')

# Print the text within the div
print (res1.text)

如果还需要获取其他幻灯片的文本,则需要单击(使用webdriver)在需要显示第二张幻灯片的位置,然后再从中获取文本.

If you need to get also the text of the other slides, you need to click (using the webdriver) where needs to make visible the second slide, before getting the text from it.

这篇关于如何使用带无头服务器且无GUI的python获取/获取聚合物spa网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆