BeautifulSoup返回不正确的文本 [英] BeautifulSoup returning incorrect text
问题描述
我想凑以下网站直播网球比分。当比赛结束的元素,我刮的变化,我可以得到分数,但是当我寻找那里的比分保持相关的跨度级的比赛中我返回类,但比分是空白的(见下文)
I'm trying to scrape the below site for live tennis scores. When the match is over the elements I'm scraping changes and I can get the score, but during the match when I search for the associated 'span' class where the score is kept I return the class but the score is blank (see below)
http://www.scoreboard.com/game/6LeqhPJd/#game-总结
score = score.findAll('span',attrs={'class':'scoreboard'})
输出:
[<span class="scoreboard">-</span>, <span class="scoreboard">-</span>]
期望的输出
[<span class="scoreboard">1</span>, <span class="scoreboard">0</span>]
用Firebug我可以看到这些领域内的成绩,但是我似乎无法返回。会有人知道为什么会发生..?
Using firebug I can see the score within these fields, however I can't seem to return it. Would anyone know why this would occur..?
请注意:当比赛在上面的网址已经完成了比分的变化的元素。这仅仅是赛事直播问题...
NOTE: When the match in the above URL has finished the element for the score changes. This is only a problem for LIVE matches...
推荐答案
的网页是使用JavaScript。如果您正在下载与的URL的urllib
,是没有得到执行的JavaScript。这么多,你看到在浏览器中是没有得到生成的HTML的。
The webpage is using JavaScript. If you are downloading the URL with urllib
, the JavaScript is not getting executed. So much of the HTML you are seeing in the browser is not getting generated.
要执行JavaScript的方法之一是使用硒。
另一种方法是使用 PyQt4的:在
One way to execute the JavaScript is to use Selenium. Another way is to use PyQt4:
import sys
from PyQt4 import QtWebKit
from PyQt4 import QtCore
from PyQt4 import QtGui
class Render(QtWebKit.QWebPage):
def __init__(self, url):
self.app = QtGui.QApplication(sys.argv)
QtWebKit.QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QtCore.QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
r = Render(url)
content = unicode(r.frame.toHtml())
一旦你有了内容
(之后的JavaScript已执行),您可以用HTML解析器(如BeautifulSoup或LXML)解析它。
Once you have content
(after the JavaScript has been executed) you can parse it with an HTML parser (like BeautifulSoup or lxml).
例如,使用lxml的
import lxml.html as LH
def clean(text):
return text.replace(u'\xa0', u'')
doc = LH.fromstring(content)
result = []
for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
row = []
for elt in tr.xpath('td'):
row.append(clean(elt.text_content()))
result.append(u', '.join(row[1:]))
print(u'\n'.join(result))
收益
Chardy J. (Fra), 2, 6, 77, , , ,
Zeballos H. (Arg), 0, 4, 63, , , ,
使用硒并的 PhantomJS (这样的GUI浏览器不弹出),这相当于code会是什么样子:
Using Selenium and PhantomJS (so that a GUI browser doesn't pop up), this is what the equivalent code would look like:
import selenium.webdriver as webdriver
import contextlib
import os
import lxml.html as LH
# define path to the phantomjs binary
phantomjs = os.path.expanduser('~/bin/phantomjs')
url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
driver.get(url)
content = driver.page_source
doc = LH.fromstring(content)
result = []
for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
row = []
for elt in tr.xpath('td'):
row.append(elt.text_content())
result.append(u', '.join(row[1:]))
print(u'\n'.join(result))
两者硒/ PhantomJS溶液和PyQt4的溶液取大约相同的时间量来运行
Both the Selenium/PhantomJS solution and the PyQt4 solution take about the same amount of time to run.
这篇关于BeautifulSoup返回不正确的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!