解析无效锚标记与BeautifulSoup正则表达式或 [英] parsing invalid anchor tag with BeautifulSoup or Regex

查看:176
本文介绍了解析无效锚标记与BeautifulSoup正则表达式或的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析解析包含HTML锚标记的原始文件,但不幸的是它包含无效代码​​,如:

I wanted parse to parse a raw document containing html anchor tag but unfortunately it contains invalid tag such as:

<a href="A 4"drive bay">some text here</a>

我知道的href 值可能不是一个实际的链接,但我们刚刚离开这种方式。现在我需要的是找回href值'A 4驱动器托架和链接文本'这里一些文本

I know the href value may not be an actual link but let's just leave it that way. now what i need is to retrieve the href value 'A 4"drive bay' and the link text 'some text here'.

我使用Python和我已经尝试了Python库 BeautifulSoup 和它的作品pretty以及在获取所有锚标签。的问题,但是,当它遇到提到其中href值包含无效锚标记它标志误差的'的的'。在原始数据我解析和修改这样的数据存在这样的情况下不是选项​​..

I am using python and i have tried the python library "BeautifulSoup" and it works pretty well in retrieving all the anchor tags. the problem though is that it flag error when it encounters the invalid anchor tag mentioned wherein the href value contains an ' " '. such case exists in the original data i am parsing and modifying such data is not an option..

使用BeautifulSoup我的Python code的部分是:

A section of my python code using BeautifulSoup is:

sub_s = BeautifulSoup(line)
for l in sub_s.find_all('a'):
   l.replace_with(l.string)
print str(sub_s),

在code刚刚取代了锚标记成纯文本

the code just replaces the anchor tag into a plain text

如果有人可以帮助我这个问题,我真的太多AP preciate它...
正则表达式也将做.. ^ ^

if someone could help me with the problem i would really much appreciate it... a regex would also do.. ^^

推荐答案

Selfhtm 8.1.2(HTML的机制的文档使用非常频繁德国)建议:

Selfhtm 8.1.2 (documention of HTML used very frequently in Germany) recommends:


  1. 第一位置拉丁字符(A-Z,A-Z)

  2. 后来拉丁字符,数字(0-9), - ,_或者

我用下面的正则表达式,以确保第一个要求:

I use the following regex to ensure the first requirement:

name="[^a-zA-Z]

(N:B。第一大空间似乎不是那么重要,适用于大多数的正则表达式的实现,例如从赫利俄斯textpad编辑器)

(n. b. first leading space seems not so important, works on most regex-implementations, e. g. textpad editor from helios)

要缓解工作,我也有对其他需求的正则表达式:它还捕捉一个字符锚(它们是有效的),但它有助于识别可能的问题:

To ease work I have also a regex for the other requirement: It catches also one character anchor (they are valid), but it will help to identify possible problems:

name=".?[^a-zA-Z0-9_\.-][^"]*"

大多数的其他问题,我发现一个语法检查。

Most of other problems I find with a syntax checker.

这篇关于解析无效锚标记与BeautifulSoup正则表达式或的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆