如何提取之前和QUOT文本; BR"? [英] How to extract text before "br"?
问题描述
我有小问题。我使用Python 2.7.8。我试图以提取前应与LT文本; BR >。我有这样的:
I have small question. I am using python 2.7.8. I am trying to extract text which should be before <br>. I have like:
<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on "Variable Names" along with answers, explanations and/or solutions:
</p>
<p>1. C99 standard guarantees uniqueness of ____ characters for internal names.<br>
a) 31<br>
b) 63<br>
c) 12<br>
d) 14</p>
<p> more </p>
<p>2. C99 standard guarantess uniqueness of _____ characters for external names.<br>
a) 31<br>
b) 6<br>
c) 12<br>
d) 14</p>
</div>
</body>
</html>
code,我有尝试,目前越来越之后&LT; BR >以前没有br.Here是code:
Code which i have tries is currently getting after <br> not before br.Here is the code:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
soup2 = BeautifulSoup(htmls)
for br2 in soup2.findAll('br'):
next = br2.previousSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.previousSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
print "Found:", next.encode('utf-8')
和输出是给我:
Found:
a) 31
Found:
b) 63
Found:
c) 12
Found:
d) 14
a) 31
Found:
b) 6
Found:
c) 12
Found:
d) 14
Found:
任何想法,我做错了。
Any idea where i am doing wrong.
推荐答案
首先,我想切换到的 BeautifulSoup
4版代替。 BeautifulSoup3很老了,不再保留:
First of all, I would switch to BeautifulSoup
version 4 instead. BeautifulSoup3 is very old and is not maintained anymore:
美丽的汤3已经被美丽的汤4所取代。
Beautiful Soup 3 has been replaced by Beautiful Soup 4.
美丽的汤3只适用于Python的2.X,但美味的汤4还
适用于Python的3.x的美丽的汤4速度更快,具有更多的功能,
并与第三方解析器像LXML和html5lib工作。一旦
测试期结束后,你应该用美丽的汤4对所有新
项目。
Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. Once the beta period is over, you should use Beautiful Soup 4 for all new projects.
运行:
pip install beautifulsoup4
和从改变你的import语句:
And change your import statement from:
from BeautifulSoup import BeautifulSoup
到
from bs4 import BeautifulSoup
现在,我会在这里做的是找到问题的文本和<一个href=\"http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-next-siblings-and-find-next-sibling\"相对=nofollow>获得以下 BR
兄弟姐妹。对于每一个兄弟姐妹,获得 NEXT_SIBLING
这将是答案的选项。工作code:
Now, what I would do here is to locate the question text and get the following br
siblings. For every sibling, get the next_sibling
which would be the answer option. Working code:
soup = BeautifulSoup(data, "html5lib") # using "html5lib" parser here
for question in soup.find_all(text=re.compile(r"^\d+\.")):
answers = [br.next_sibling.strip() for br in question.find_next_siblings("br")]
print(question)
print(answers)
print("------")
有关问题中提供的样本HTML,它打印:
For the sample HTML provided in the question, it prints:
1. C99 standard guarantees uniqueness of ____ characters for internal names.
[u'a) 31', u'b) 63', u'c) 12', u'd) 14']
------
2. C99 standard guarantess uniqueness of _____ characters for external names.
[u'a) 31', u'b) 6', u'c) 12', u'd) 14']
------
请注意,你可能需要安装 html5lib
库:
Note that you might need to install html5lib
library:
pip install html5lib
这篇关于如何提取之前和QUOT文本; BR&QUOT;?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!