python中的regex模式,用于解析HTML标题标签 [英] regex pattern in python for parsing HTML title tags
问题描述
我正在学习在python中同时使用re
模块和urllib
模块,并尝试编写一个简单的Web抓取工具.这是我编写的仅刮擦网站标题的代码:
I am learning to use both the re
module and the urllib
module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
这会为Google和Reddit提供正确的输出,但不会为Facebook提供正确的输出-像这样:
This gives the correct output for Google and Reddit but not for Facebook - like so:
['Google']
[]
['reddit: the front page of the internet']
这是因为,我发现在Facebook页面上的title
标签如下:<title id="pageTitle">
.为了适应其他id=
,我对these_regex
变量进行了如下修改:these_regex="<title.+?>(.+?)</title>"
.但这给出了以下输出:
This is because, I found that on Facebook's page the title
tag is as follows: <title id="pageTitle">
. To accomodate for the additional id=
, I modified the these_regex
variable as follows: these_regex="<title.+?>(.+?)</title>"
. But this gives the following output:
[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]
如何将两者结合起来,以便考虑在title
标记内传递的其他任何参数?
How would I combine both so that I can take into account any additional parameters passed within the title
tag?
推荐答案
您正在使用正则表达式,并且将与此类表达式匹配的HTML变得太复杂,太快了.
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
请改用HTML解析器,Python有多种选择.我建议您使用 BeautifulSoup ,这是一个受欢迎的第三方图书馆.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup示例:
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
由于title
标记本身不包含其他标记,因此您可以在此处使用正则表达式,但是一旦尝试解析嵌套的标记,您就会 遇到很多问题复杂的问题.
Since a title
tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.
您可以通过匹配title
标记内的其他字符来解决您的特定问题,可选地:
Your specific problem can be solved by matching additional characters within the title
tag, optionally:
r'<title[^>]*>([^<]+)</title>'
这将匹配0个或多个不是右>
括号的字符.此处的"0或更多"可让您同时匹配额外的属性和普通的<title>
标记.
This matches 0 or more characters that are not the closing >
bracket. The '0 or more' here lets you match both extra attributes and the plain <title>
tag.
这篇关于python中的regex模式,用于解析HTML标题标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!