我如何才能找到在Python用BeautifulSoup文本字符串后的表? [英] How can I find a table after a text string using BeautifulSoup in Python?
问题描述
我想从几个网页,这是不是在他们如何展示自己的表统一提取数据。我需要写code,将搜索文本字符串,然后转至表立即特定的文本字符串以下。然后我想提取表的内容。下面是我到目前为止有:
I am trying to extract data from several web pages which are not uniform in how they display their tables. I need to write code that will search for a text string and then go to the table immediately following that specific text string. Then I want to extract the contents of that table. Here's what I've got so far:
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = ""
print text+"|",
print
不过,我得到以下错误:
However, I get the following error:
soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'
有一个简单的方法来做到这一点使用BeautifulSoup?
Is there an easy way to do this using BeautifulSoup?
推荐答案
的错误是由于这样的事实,<一href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#findAllNext%28name,%20attrs,%20text,%20limit,%20%2a%2akwargs%29%20and%20findNext%28name,%20attrs,%20text,%20%2a%2akwargs%29\"><$c$c>findAllNext$c$c>为标签的方法
的对象,但 foundtext
是的ResultSet
对象,这是匹配标签或字符串的列表的。您可以通过各个标签在 foundtext
迭代,但根据你的需求可能就足够使用<一个href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#find%28name,%20attrs,%20recursive,%20text,%20%2a%2akwargs%29\"><$c$c>find$c$c>,仅返回第一个匹配的标签。
The error is due to the fact that findAllNext
is a method of Tag
objects, but foundtext
is a ResultSet
object, which is a list of matching tags or strings. You could iterate through the each of the tags in foundtext
, but depending on your needs it might be sufficient to use find
, which returns only the first matching tag.
这是你的code的修改版本。改变后 foundtext
使用 soup.find
,我发现与表固定了同样的问题
。我修改你的正则表达式<一个href=\"http://stackoverflow.com/questions/4590298/how-to-ignore-whitespace-in-a-regular-ex$p$pssion-subject-string\">ignore单词之间的空格的:
Here's a modified version of your code. After changing foundtext
to use soup.find
, I found and fixed the same problem with table
. I modified your regex to ignore whitespace between the words:
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = ""
print text+"|",
print
此输出:
1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|
这篇关于我如何才能找到在Python用BeautifulSoup文本字符串后的表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!