在Python中浏览HTML DOM [英] Going through HTML DOM in Python
问题描述
我正在寻找一个Python脚本(使用3.4.3),从URL中抓取一个HTML页面,并且可以通过DOM尝试找到一个特定的元素。
I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.
我目前有这样的:
#!/usr/bin/env python
import urllib.request
def getSite(url):
return urllib.request.urlopen(url)
if __name__ == '__main__':
content = getSite('http://www.google.com').read()
print(content)
当我打印内容时,它会打印整个html页面,这是一个接近我想要的东西...虽然我最好能够通过DOM导航,而不是将其视为一个巨大的字符串。
When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.
我对Python还是比较新的,但有其他多种语言(主要是Java,C#,C ++,C,PHP,JS)的经验。我以前做过类似Java的事情,但是想在Python中尝试一下。
I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.
任何帮助都不胜感激。
干杯!
Any help is appreciated. Cheers!
推荐答案
有许多不同的模块可以使用。例如, lxml 或 BeautifulSoup 。
There are many different modules you could use. For example, lxml or BeautifulSoup.
这是一个 lxml
示例:
import lxml.html
mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)
description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag
>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
一个 BeautifulSoup
示例:
from bs4 import BeautifulSoup
mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)
description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute
>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
<而 lxml
则没有。这可能是有用的/有害的取决于需要什么。
Notice how BeautifulSoup
returns a unicode string, while lxml
does not. This can be useful/hurtful depending on what is needed.
这篇关于在Python中浏览HTML DOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!