BeautifulSoup返回不正确的文本 [英] BeautifulSoup returning incorrect text

查看:122
本文介绍了BeautifulSoup返回不正确的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想凑以下网站直播网球比分。当比赛结束的元素,我刮的变化,我可以得到分数,但是当我寻找那里的比分保持相关的跨度级的比赛中我返回类,但比分是空白的(见下文)

I'm trying to scrape the below site for live tennis scores. When the match is over the elements I'm scraping changes and I can get the score, but during the match when I search for the associated 'span' class where the score is kept I return the class but the score is blank (see below)

http://www.scoreboard.com/game/6LeqhPJd/#game-总结

score = score.findAll('span',attrs={'class':'scoreboard'})

输出:

[<span class="scoreboard">-</span>, <span class="scoreboard">-</span>]

期望的输出

[<span class="scoreboard">1</span>, <span class="scoreboard">0</span>]

用Firebug我可以看到这些领域内的成绩,但是我似乎无法返回。会有人知道为什么会发生..?

Using firebug I can see the score within these fields, however I can't seem to return it. Would anyone know why this would occur..?

请注意:当比赛在上面的网址已经完成了比分的变化的元素。这仅仅是赛事直播问题...

NOTE: When the match in the above URL has finished the element for the score changes. This is only a problem for LIVE matches...

推荐答案

的网页是使用JavaScript。如果您正在下载与的URL的urllib ,是没有得到执行的JavaScript。这么多,你看到在浏览器中是没有得到生成的HTML的。

The webpage is using JavaScript. If you are downloading the URL with urllib, the JavaScript is not getting executed. So much of the HTML you are seeing in the browser is not getting generated.

要执行JavaScript的方法之一是使用
另一种方法是使用 PyQt4的:在

One way to execute the JavaScript is to use Selenium. Another way is to use PyQt4:

import sys
from PyQt4 import QtWebKit
from PyQt4 import QtCore
from PyQt4 import QtGui

class Render(QtWebKit.QWebPage):
    def __init__(self, url):
        self.app = QtGui.QApplication(sys.argv)
        QtWebKit.QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QtCore.QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
r = Render(url)
content = unicode(r.frame.toHtml())

一旦你有了内容之后的JavaScript已执行),您可以用HTML解析器(如BeautifulSoup或LXML)解析它。

Once you have content (after the JavaScript has been executed) you can parse it with an HTML parser (like BeautifulSoup or lxml).

例如,使用lxml的

import lxml.html as LH

def clean(text):
    return text.replace(u'\xa0', u'')

doc = LH.fromstring(content)   
result = []
for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
    row = []
    for elt in tr.xpath('td'):
        row.append(clean(elt.text_content()))
    result.append(u', '.join(row[1:]))
print(u'\n'.join(result))

收益

Chardy J. (Fra), 2, 6, 77, , , , 
Zeballos H. (Arg), 0, 4, 63, , , , 


使用并的 PhantomJS (这样的GUI浏览器不弹出),这相当于code会是什么样子:


Using Selenium and PhantomJS (so that a GUI browser doesn't pop up), this is what the equivalent code would look like:

import selenium.webdriver as webdriver
import contextlib
import os
import lxml.html as LH

# define path to the phantomjs binary
phantomjs = os.path.expanduser('~/bin/phantomjs')
url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    content = driver.page_source
    doc = LH.fromstring(content)   
    result = []
    for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
        row = []
        for elt in tr.xpath('td'):
            row.append(elt.text_content())
        result.append(u', '.join(row[1:]))
    print(u'\n'.join(result))

两者硒/ PhantomJS溶液和PyQt4的溶液取大约相同的时间量来运行

Both the Selenium/PhantomJS solution and the PyQt4 solution take about the same amount of time to run.

这篇关于BeautifulSoup返回不正确的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆