如何使用Python获取HTML文件? [英] How to get an HTML file using Python?
本文介绍了如何使用Python获取HTML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何检索页面?我的两个主要问题是;使用哪些函数以及如何从页面中过滤掉无用链接?使用urlib和lxml.html的示例:
使用urlib和lxml.html的示例 p> 从lxml导入urllib
导入html
url =http://www.infolanka .com / miyuru_gee / art / art.html
page = html.fromstring(urllib.urlopen(url).read())
用于链接page.xpath(// a):
printName,link.text,URL,link.get(href)
输出>>
['Aathma Liyanage','athma.html'),
('Abewardhana Balasuriya','abewardhana.html'),
('Aelian Thilakeratne','aelian_thi.html' ),
('Ahamed Mohideen','ahamed.html'),
]
I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.
How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?
解决方案
Example using urlib and lxml.html:
import urllib
from lxml import html
url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())
for link in page.xpath("//a"):
print "Name", link.text, "URL", link.get("href")
output >>
[('Aathma Liyanage', 'athma.html'),
('Abewardhana Balasuriya', 'abewardhana.html'),
('Aelian Thilakeratne', 'aelian_thi.html'),
('Ahamed Mohideen', 'ahamed.html'),
]
这篇关于如何使用Python获取HTML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文