如何提取之前和QUOT文本; BR＆QUOT;？ [英] How to extract text before "br"?

查看：368 发布时间：2016/8/5 19:07:35 python html beautifulsoup html-parsing

本文介绍了如何提取之前和QUOT文本; BR＆QUOT;？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有小问题。我使用Python 2.7.8。我试图以提取前应与LT文本; BR >。我有这样的：

I have small question. I am using python 2.7.8. I am trying to extract text which should be before <br>. I have like:

<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on "Variable Names" along with answers, explanations and/or solutions:
</p>

<p>1. C99 standard guarantees uniqueness of ____ characters for internal names.<br>
a) 31<br>
b) 63<br>
c) 12<br>
d) 14</p>
<p> more </p>
<p>2. C99 standard guarantess uniqueness of _____ characters for external names.<br>
a) 31<br>
b) 6<br>
c) 12<br>
d) 14</p>
 </div>
</body>
</html>

code，我有尝试，目前越来越之后＆LT; BR >以前没有br.Here是code：

Code which i have tries is currently getting after <br> not before br.Here is the code:

from BeautifulSoup import BeautifulSoup, NavigableString, Tag
soup2 = BeautifulSoup(htmls)

for br2 in soup2.findAll('br'):
    next = br2.previousSibling
    if not (next and isinstance(next,NavigableString)):
        continue
    next2 = next.previousSibling
    if next2 and isinstance(next2,Tag) and next2.name == 'br':
        text = str(next).strip()
        if text:

            print "Found:", next.encode('utf-8')

和输出是给我：

Found: 
a) 31
Found: 
b) 63
Found: 
c) 12
Found:
d) 14 
a) 31
Found: 
b) 6
Found: 
c) 12
Found:
d) 14 
Found:

任何想法，我做错了。

Any idea where i am doing wrong.

推荐答案

首先，我想切换到的 BeautifulSoup 4版代替。 BeautifulSoup3很老了，不再保留：

First of all, I would switch to BeautifulSoup version 4 instead. BeautifulSoup3 is very old and is not maintained anymore:

美丽的汤3已经被美丽的汤4所取代。

Beautiful Soup 3 has been replaced by Beautiful Soup 4.

美丽的汤3只适用于Python的2.X，但美味的汤4还
  适用于Python的3.x的美丽的汤4速度更快，具有更多的功能，
  并与第三方解析器像LXML和html5lib工作。一旦
  测试期结束后，你应该用美丽的汤4对所有新
  项目。

Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. Once the beta period is over, you should use Beautiful Soup 4 for all new projects.

运行：

pip install beautifulsoup4

和从改变你的import语句：

And change your import statement from:

from BeautifulSoup import BeautifulSoup

到

from bs4 import BeautifulSoup

现在，我会在这里做的是找到问题的文本和<一个href=\"http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-next-siblings-and-find-next-sibling\"相对=nofollow>获得以下 BR 兄弟姐妹。对于每一个兄弟姐妹，获得 NEXT_SIBLING 这将是答案的选项。工作code：

Now, what I would do here is to locate the question text and get the following br siblings. For every sibling, get the next_sibling which would be the answer option. Working code:

soup = BeautifulSoup(data, "html5lib")  # using "html5lib" parser here

for question in soup.find_all(text=re.compile(r"^\d+\.")):
    answers = [br.next_sibling.strip() for br in question.find_next_siblings("br")]

    print(question)
    print(answers)
    print("------")

有关问题中提供的样本HTML，它打印：

For the sample HTML provided in the question, it prints:

1. C99 standard guarantees uniqueness of ____ characters for internal names.
[u'a) 31', u'b) 63', u'c) 12', u'd) 14']
------
2. C99 standard guarantess uniqueness of _____ characters for external names.
[u'a) 31', u'b) 6', u'c) 12', u'd) 14']
------

请注意，你可能需要安装 html5lib 库：

Note that you might need to install html5lib library:

pip install html5lib

这篇关于如何提取之前和QUOT文本; BR＆QUOT;？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何提取之前和QUOT文本; BR＆QUOT;？ [英] How to extract text before "br"?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何提取之前和QUOT文本; BR＆Q​​UOT;？ [英] How to extract text before &quot;br&quot;?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

如何提取之前和QUOT文本; BR＆QUOT;？ [英] How to extract text before "br"?

登录关闭