尝试Python的BeautifulSoup和幻影JS:仍然不能凑网站 [英] Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

查看:379
本文介绍了尝试Python的BeautifulSoup和幻影JS:仍然不能凑网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可能已经看到我绝望挫折,在过去的几个星期就到这里。我已经刮了一些等待时间的数据和我仍然无法从抢这两个网站数据

You may have seen my desperate frustrations over the past few weeks on here. I've been scraping some wait time data and am still unable to grab data from these two sites

http://www.centura.org/erwait

http://hcavirginia.com/home/

起初我试过BS4的Python。示例code低于HCA弗吉尼亚

At first I tried BS4 for Python. Sample code below for HCA Virgina

from BeautifulSoup import BeautifulSoup
import requests

url = 'http://hcavirginia.com/home/'
r = requests.get(url)

soup = BeautifulSoup(r.text)
wait_times = [span.text for span in soup.findAll('span', attrs={'class': 'ehc-er-digits'})]

fd = open('HCA_Virginia.csv', 'a')

for w in wait_times:
    fd.write(w + '\n')

fd.close()

这一切确实是空白打印到控制台或CSV。所以我PhantomJS试了一下,因为有人告诉我,它可能会被JS加载。然而,同样的结果!空白打印到控制台或CSV。下面的示例code。

All this does is print blanks to the console or the CSV. So I tried it with PhantomJS since someone told me it may be loading with JS. Yet, same result! Prints blanks to console or CSV. Sample code below.

var page = require('webpage').create(),
url = 'http://hcavirginia.com/home/';

page.open(url, function(status) {
if (status !== "success") {
    console.log("Can't access network");
} else {
    var result = page.evaluate(function() {

        var list = document.querySelectorAll('span.ehc-er-digits'), time = [], i;
        for (i = 0; i < list.length; i++) {
            time.push(list[i].innerText);
        }
        return time;

    });
    console.log (result.join('\n'));
    var fs = require('fs');
    try 
    {                   
        fs.write("HCA_Virginia.csv", '\n' + result.join('\n'), 'a');
    } 
    catch(e) 
    {
        console.log(e); 
    } 
}

phantom.exit();
});

CenturaBuilder中生同样的问题:(

Same issues with Centura Health :(

我在做什么错了?

推荐答案

你面临的问题是,元素由JS创建的,它可能需要一些时间来加载它们。你需要它处理JS刮板,并且可以等到在创建所需的元素。

The problem you're facing is that the elements are created by JS, and it might take some time to load them. You need a scraper which handles JS, and can wait until the required elements are created.

您可以使用 PyQt4中。适应这个配方从webscraping.com 和HTML解析器像<一个HREF =htt​​p://www.crummy.com/software/BeautifulSoup/bs4/doc/相对=nofollow> BeautifulSoup ,这是pretty容易:

You can use PyQt4. Adapting this recipe from webscraping.com and a HTML parser like BeautifulSoup, this is pretty easy:

(写这篇后,我发现蟒蛇的 webscraping 库。这可能是一个值得看)

(after writing this, I found the webscraping library for python. It might be worthy a look)

import sys
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import * 

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()   

url = 'http://hcavirginia.com/home/'
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
# In Python 3.x, don't unicode the output from .toHtml(): 
#soup = BeautifulSoup(r.frame.toHtml()) 
nums = [int(span) for span in soup.find_all('span', class_='ehc-er-digits')]
print nums

输出:

[21, 23, 47, 11, 10, 8, 68, 56, 19, 15, 7]


这是我原来的答复,使用 ghost.py

我设法你使用 ghost.py 一起为黑客的东西。 (Python的2.7测试,ghost.py 0.1b3和<一个href=\"http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.10.3/PyQt4-4.10.3-gpl-Py2.7-Qt4.8.5-x32.exe\"相对=nofollow> PyQt4-4 32位)。我不建议在生产中code,虽然用这个!

I managed to hack something together for you using ghost.py. (tested on Python 2.7, ghost.py 0.1b3 and PyQt4-4 32-bit). I wouldn't recommend to use this in production code though!

from ghost import Ghost
from time import sleep

ghost = Ghost(wait_timeout=50, download_images=False)
page, extra_resources = ghost.open('http://hcavirginia.com/home/',
                                   headers={'User-Agent': 'Mozilla/4.0'})

# Halt execution of the script until a span.ehc-er-digits is found in 
# the document
page, resources = ghost.wait_for_selector("span.ehc-er-digits")

# It should be possible to simply evaluate
# "document.getElementsByClassName('ehc-er-digits');" and extract the data from
# the returned dictionary, but I didn't quite understand the
# data structure - hence this inline javascript.
nums, resources = ghost.evaluate(
    """
    elems = document.getElementsByClassName('ehc-er-digits');
    nums = []
    for (i = 0; i < elems.length; ++i) {
        nums[i] = elems[i].innerHTML;
    }
    nums;
    """)

wt_data = [int(x) for x in nums]
print wt_data
sleep(30) # Sleep a while to avoid the crashing of the script. Weird issue!

一些评论:


  • 你可以从我的意见看,我并没有完全从搞清楚返回的字典结构Ghost.evaluate(document.getElementsByClassName('EHC-ER-数字') ) - 其可能可以找到使用这种查询虽然所需的信息。

  • As you can see from my comments, I didn't quite figure out the structure of the returned dict from Ghost.evaluate(document.getElementsByClassName('ehc-er-digits');) - its probably possible to find the information needed using such a query though.

我也有一些问题,剧本在最后崩溃。沉睡30秒固定的问题。

I also had some problems with the script crashing at the end. Sleeping for 30 seconds fixed the issue.

这篇关于尝试Python的BeautifulSoup和幻影JS:仍然不能凑网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆