使用机械化和美丽的汤,在HTML中使用原始HTML与DOM进行刮刮 [英] Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

查看:123
本文介绍了使用机械化和美丽的汤,在HTML中使用原始HTML与DOM进行刮刮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个程序,例如,将刮掉这个网页的最高价格:

I am attempting to write a program that, as an example, will scrape the top price off of this web page:

http://www.kayak.com/#/flights/JFK -PAR / 2012-06-01 / 2012-07-01 / 1adults

首先,我很容易通过执行以下操作来检索HTML:

First, I am easily able to retrieve the HTML by doing the following:

from urllib import urlopen 
from BeautifulSoup import BeautifulSoup
import mechanize

webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()

soup = BeautifulSoup(data)
print soup

但是,原始HTML不包含价格。浏览器是...这是事情(澄清这里也可能帮助我)...并从其他地方检索价格,而构建DOM树。

However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.

我被领导相信机械化将像我的浏览器一样运行,并返回DOM树,我也相信我会看到,例如,Chrome的开发者工具视图(如果我对此不正确) ,我该如何去获取任何价格信息?)有没有什么我需要告诉机械化去做DOM树?

I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?

一旦我可以将DOM树变成python,我需要做的其他事情应该是一个很好的选择。谢谢!

Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!

推荐答案

机械化和美丽的汤是在python中无人值守的工具web-scrapping。

Mechanize and Beautiful soup are un-beatable tools web-scrapping in python.

但是您需要了解什么是什么:

But you need to understand what is meant for what:

机械化:它模仿

BeautifulSoup :HTML解析器,即使HTML格式不正确,效果很好

BeautifulSoup : HTML parser, works well even when HTML is not well-formed.

您的问题似乎是 javascript 。通过使用 javascript 通过ajax调用收集价格。 Mechanize ,但是,并不执行javascript,所以任何从javascript生成的内容将保持不可见的机械化。

Your problem seems to be javascript. The price is getting populated via an ajax call using javascript. Mechanize, however, does not do javascript, so any content that results from javascript will remain invisible to mechanize.

看看这个: http://github.com/davisp/python-spidermonkey/tree / master

这是一个机械化包装器,美丽的汤用js执行。

This does a wrapper on mechanize and Beautiful soup with js execution.

这篇关于使用机械化和美丽的汤,在HTML中使用原始HTML与DOM进行刮刮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆