在Python中浏览HTML DOM [英] Going through HTML DOM in Python

查看:948
本文介绍了在Python中浏览HTML DOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个Python脚本(使用3.4.3),从URL中抓取一个HTML页面,并且可以通过DOM尝试找到一个特定的元素。

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

我目前有这样的:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

当我打印内容时,它会打印整个html页面,这是一个接近我想要的东西...虽然我最好能够通过DOM导航,而不是将其视为一个巨大的字符串。

When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.

我对Python还是比较新的,但有其他多种语言(主要是Java,C#,C ++,C,PHP,JS)的经验。我以前做过类似Java的事情,但是想在Python中尝试一下。

I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

任何帮助都不胜感激。
干杯!

Any help is appreciated. Cheers!

推荐答案

有许多不同的模块可以使用。例如, lxml BeautifulSoup

There are many different modules you could use. For example, lxml or BeautifulSoup.

这是一个 lxml 示例:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

一个 BeautifulSoup 示例:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."



<而 lxml 则没有。这可能是有用的/有害的取决于需要什么。

Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

这篇关于在Python中浏览HTML DOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆