如何在 Python 中使用 BeautifulSoup 在文本字符串之后找到表格? [英] How can I find a table after a text string using BeautifulSoup in Python?
问题描述
我试图从几个网页中提取数据,这些网页的表格显示方式不一致.我需要编写代码来搜索文本字符串,然后立即转到该特定文本字符串后面的表格.然后我想提取该表的内容.这是我到目前为止所得到的:
I am trying to extract data from several web pages which are not uniform in how they display their tables. I need to write code that will search for a text string and then go to the table immediately following that specific text string. Then I want to extract the contents of that table. Here's what I've got so far:
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = ""
print text+"|",
print
但是,我收到以下错误:
However, I get the following error:
soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'
是否有使用 BeautifulSoup 的简单方法?
Is there an easy way to do this using BeautifulSoup?
推荐答案
错误是由于 findAllNext
是一个Tag
对象的方法,而foundtext
是一个ResultSet
对象,它是一个list 匹配的标签或字符串.您可以遍历 foundtext
中的每个标签,但根据您的需要,使用 find
,只返回第一个匹配的标签.
The error is due to the fact that findAllNext
is a method of Tag
objects, but foundtext
is a ResultSet
object, which is a list of matching tags or strings. You could iterate through the each of the tags in foundtext
, but depending on your needs it might be sufficient to use find
, which returns only the first matching tag.
这是您的代码的修改版本.将 foundtext
更改为使用 soup.find
后,我发现并修复了与 table
相同的问题.我将您的正则表达式修改为 忽略单词之间的空格一个>:
Here's a modified version of your code. After changing foundtext
to use soup.find
, I found and fixed the same problem with table
. I modified your regex to ignore whitespace between the words:
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Tables+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = ""
print text+"|",
print
输出:
1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|
这篇关于如何在 Python 中使用 BeautifulSoup 在文本字符串之后找到表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!