Python Regex 找不到子字符串,但它应该 [英] Python Regex can't find substring but it should

查看:35
本文介绍了Python Regex 找不到子字符串,但它应该的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BeautifulSoup 解析 html 以尝试提取网页标题.有时这会因为网站写得不好而不起作用,例如 Bad End 标签.当这不起作用时,我会转到手动正则表达式

I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. When this does not work I go to manual regex

我有文字

<html xmlns="http://www.w3.org/1999/xhtml"\n      xmlns:og="http://ogp.me/ns#"\n      xmlns:fb="https://www.facebook.com/2008/fbml">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n    <title>\n                    .@wolfblitzercnn prepping questions for the Cheney intvw. @CNNSitRoom today. 5p. \n            </title>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />...

我正在尝试获取 </code> 和 <code> 标签之间的值.它应该相当简单,但它不起作用.这是我的 Python 代码.

And I am trying to grab the values between the <title> and </title> tags. It should be fairly simple, but it is not working. Here's my python code for it.

result = re.search('\<title\>(.+?)\</title\>', html)
if result is not None:
    title = result.group(0)

无论出于何种原因,这都不适用于此文本.它返回 result.group() 作为 None 或者我得到一个 AttributeError.AttributeError: 'NoneType' 对象没有属性 'groups'

This does not work on this text for whatever reason. It returns result.group() as None or I get an AttributeError. AttributeError: 'NoneType' object has no attribute 'groups'

我已将此文本 C&P 放入在线 Python 正则表达式开发人员并尝试了所有选项(re.match、re.findall、re.search)并且它们在那里工作,但无论出于何种原因在我的脚本中它不是能够在这些标签之间找到任何东西.甚至尝试其他正则表达式,例如

I've C&P'd this text into online python regex developers and tried all the options (re.match, re.findall, re.search) and they work there but for whatever reason in my script it is not able to find anything between these tags. Even trying other regex such as

<title>(.*?)</title>

推荐答案

你应该使用 dotall 标志 使 . 也匹配换行符.

You should use the dotall flag to make the . match newline characters as well.

result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)

正如文档所说:

...没有这个标志,'.' 将匹配任何除了换行符

...without this flag, '.' will match anything except a newline

这篇关于Python Regex 找不到子字符串,但它应该的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆