使用机械化和美丽的汤,在HTML中使用原始HTML与DOM进行刮刮 [英] Raw HTML vs. DOM scraping in python using mechanize and beautiful soup
问题描述
我正在尝试编写一个程序,例如,将刮掉这个网页的最高价格:
I am attempting to write a program that, as an example, will scrape the top price off of this web page:
http://www.kayak.com/#/flights/JFK -PAR / 2012-06-01 / 2012-07-01 / 1adults
首先,我很容易通过执行以下操作来检索HTML:
First, I am easily able to retrieve the HTML by doing the following:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import mechanize
webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()
soup = BeautifulSoup(data)
print soup
但是,原始HTML不包含价格。浏览器是...这是事情(澄清这里也可能帮助我)...并从其他地方检索价格,而构建DOM树。
However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.
我被领导相信机械化将像我的浏览器一样运行,并返回DOM树,我也相信我会看到,例如,Chrome的开发者工具视图(如果我对此不正确) ,我该如何去获取任何价格信息?)有没有什么我需要告诉机械化去做DOM树?
I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?
一旦我可以将DOM树变成python,我需要做的其他事情应该是一个很好的选择。谢谢!
Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!
推荐答案
机械化和美丽的汤是在python中无人值守的工具web-scrapping。
Mechanize and Beautiful soup are un-beatable tools web-scrapping in python.
但是您需要了解什么是什么:
But you need to understand what is meant for what:
机械化
:它模仿
BeautifulSoup
:HTML解析器,即使HTML格式不正确,效果很好
BeautifulSoup
: HTML parser, works well even when HTML is not well-formed.
您的问题似乎是 javascript
。通过使用 javascript
通过ajax调用收集价格。 Mechanize
,但是,并不执行javascript,所以任何从javascript生成的内容将保持不可见的机械化。
Your problem seems to be javascript
. The price is getting populated via an ajax call using javascript
. Mechanize
, however, does not do javascript, so any content that results from javascript will remain invisible to mechanize.
看看这个: http://github.com/davisp/python-spidermonkey/tree / master
这是一个机械化包装器,美丽的汤用js执行。
This does a wrapper on mechanize and Beautiful soup with js execution.
这篇关于使用机械化和美丽的汤,在HTML中使用原始HTML与DOM进行刮刮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!