python中的regex模式,用于解析HTML标题标签 [英] regex pattern in python for parsing HTML title tags

查看:62
本文介绍了python中的regex模式,用于解析HTML标题标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习在python中同时使用re模块和urllib模块,并尝试编写一个简单的Web抓取工具.这是我编写的仅刮擦网站标题的代码:

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:

#!/usr/bin/python

import urllib
import re

urls=["http://google.com","https://facebook.com","http://reddit.com"]

i=0

these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)

while(i<len(urls)):
        htmlfile=urllib.urlopen(urls[i])
        htmltext=htmlfile.read()
        titles=re.findall(pattern,htmltext)
        print titles
        i+=1

这会为Google和Reddit提供正确的输出,但不会为Facebook提供正确的输出-像这样:

This gives the correct output for Google and Reddit but not for Facebook - like so:

['Google']
[]
['reddit: the front page of the internet']

这是因为,我发现在Facebook页面上的title标签如下:<title id="pageTitle">.为了适应其他id=,我对these_regex变量进行了如下修改:these_regex="<title.+?>(.+?)</title>".但这给出了以下输出:

This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:

[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]

如何将两者结合起来,以便考虑在title标记内传递的其他任何参数?

How would I combine both so that I can take into account any additional parameters passed within the title tag?

推荐答案

您正在使用正则表达式,并且将与此类表达式匹配的HTML变得太复杂,太快了.

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

请改用HTML解析器,Python有多种选择.我建议您使用 BeautifulSoup ,这是一个受欢迎的第三方图书馆.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup示例:

BeautifulSoup example:

from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

由于title标记本身不包含其他标记,因此您可以在此处使用正则表达式,但是一旦尝试解析嵌套的标记,您就会 遇到很多问题复杂的问题.

Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

您可以通过匹配title标记内的其他字符来解决您的特定问题,可选地:

Your specific problem can be solved by matching additional characters within the title tag, optionally:

r'<title[^>]*>([^<]+)</title>'

这将匹配0个或多个不是>括号的字符.此处的"0或更多"可让您同时匹配额外的属性和普通的<title>标记.

This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.

这篇关于python中的regex模式,用于解析HTML标题标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆