使用BeautifulSoup解析单引号属性值中的非转义撇号 [英] Parsing a non-escaped apostrophe in a single-quoted attribute value with BeautifulSoup

查看:52
本文介绍了使用BeautifulSoup解析单引号属性值中的非转义撇号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网页中获取所有链接和标题字符串.我使用BeautifulSoup 4进行​​抓取.网页上的链接如下所示:

From a webpage, I want to get all the links and title strings. I use BeautifulSoup 4 for scraping. The links on the webpage look like this:

<a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
<a href='http://www.example2.com' title='Don't make me think #4/9/2014 4:36:07 AM'> Example 2</a>

该抓取解决方案效果很好:

The scraping solution works well:

#Import
import codecs
import urllib   
from bs4 import BeautifulSoup

#Parse
url = "http://www.website-to-scrape.com"
sock = urllib.urlopen(url)
htmlsrc = sock.read()
sock.close()
html = BeautifulSoup(htmlsrc)
html.__str__()
alllinks = html.find_all('a', href=True, title = True)

for tags in range(len(alllinks)-1):
        link = alllinks[tags]['href'].encode('utf-8') 
        title = alllinks[tags]['title'].encode('utf-8')
print title

问题:BeautifulSoup不知道如何在字符串(即')中正确地转义单引号.

Problem: BeautifulSoup does not know how to properly escape single quotes within a string, i.e. '.

因此,例如example2,它将仅输出 Don :

So for example2, it will only output Don:

A small secret for better estimates #4/16/2014 8:10:30 AM
Don

推荐答案

问题不是BeautifulSoup,而是您的HTML,这是无效的.根据 HTML规范,单引号属性值具有以下语法:

The problem is not BeautifulSoup but your HTML, which is invalid. According to the HTML specification, a single-quoted attribute value has the following syntax:

属性名称,后跟零个或多个空格字符,后跟一个U + 003D空格符号字符,后跟零个或多个空格字符,后跟一个U + 0027 APOSTROPHE字符('),后跟属性值,除了上面对属性值的要求外,不得包含任何文字U + 0027 APOSTROPHE字符('),最后必须跟第二个单个U + 0027 APOSTROPHE字符(').

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single U+0027 APOSTROPHE character ('), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0027 APOSTROPHE characters ('), and finally followed by a second single U+0027 APOSTROPHE character (').

所有解析器均受支持由BeautifulSoup撰写的文章将尝试解析您问题中的无效HTML,它们都不会做您想做的事情:

While all of the parsers supported by BeautifulSoup will try to parse the invalid HTML in your question, none of them will do what you want:

>>> BeautifulSoup(src, "html.parser")

<a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a #4="" 2014="" 4:36:07="" 9="" am'="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>

>>> BeautifulSoup(src, "lxml")

<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a am="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>
</body></html>

>>> BeautifulSoup(src, "html5lib")

<html><head></head><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a #4="" 2014="" 4:36:07="" 9="" am'="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>
</body></html>

任何现代浏览器都不会:

Neither will any modern browser:

Firefox

Chrome

IE 11

如果要在单引号属性值内表示撇号,则需要使用& 字符实体引用:

If you want to represent an apostrophe inside a single-quoted attribute value, you need to use the &apos; character entity reference:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title='Don&apos;t make me think #4/9/2014 4:36:07 AM'> Example 2</a>
... """)

<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a href="http://www.example2.com" title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
</body></html>

或者,您可以使用双引号属性值:

Alternatively, you can use a double-quoted attribute value:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
... """)

<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a href="http://www.example2.com" title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
</body></html>

这篇关于使用BeautifulSoup解析单引号属性值中的非转义撇号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆