python中用于解析HTML标题标签的正则表达式模式 [英] regex pattern in python for parsing HTML title tags
问题描述
我正在学习在 python 中同时使用 re
模块和 urllib
模块并尝试编写一个简单的网络抓取工具.这是我编写的用于抓取网站标题的代码:
I am learning to use both the re
module and the urllib
module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
这为 Google 和 Reddit 提供了正确的输出,但不适用于 Facebook - 像这样:
This gives the correct output for Google and Reddit but not for Facebook - like so:
['Google']
[]
['reddit: the front page of the internet']
这是因为,我发现在 Facebook 的页面上,title
标签如下:<title id="pageTitle">
.为了适应额外的 id=
,我修改了 se_regex
变量如下:these_regex="<title.+?>(.+?)</title>"
.但这给出了以下输出:
This is because, I found that on Facebook's page the title
tag is as follows: <title id="pageTitle">
. To accomodate for the additional id=
, I modified the these_regex
variable as follows: these_regex="<title.+?>(.+?)</title>"
. But this gives the following output:
[]
['Welcome to Facebook xe2x80x94 Log in, sign up or learn more']
[]
如何将两者结合起来,以便我可以考虑在 title
标签内传递的任何其他参数?
How would I combine both so that I can take into account any additional parameters passed within the title
tag?
推荐答案
您正在使用正则表达式,而将 HTML 与此类表达式匹配会变得太复杂、太快.
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
改用 HTML 解析器,Python 有几个可供选择.我建议您使用 BeautifulSoup,一个流行的 3rd 方库.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup 示例:
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
由于 title
标签本身不包含其他标签,您可以在这里使用正则表达式,但是一旦您尝试解析嵌套标签,您就会 遇到极其复杂的问题.
Since a title
tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.
您的具体问题可以通过匹配 title
标签中的附加字符来解决,可选:
Your specific problem can be solved by matching additional characters within the title
tag, optionally:
r'<title[^>]*>([^<]+)</title>'
这匹配 0 个或多个 不是 结束 >
括号的字符.此处的0 或更多"可让您匹配额外的属性和普通的
标签.
This matches 0 or more characters that are not the closing >
bracket. The '0 or more' here lets you match both extra attributes and the plain <title>
tag.
这篇关于python中用于解析HTML标题标签的正则表达式模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!