如何在 Python 中使用 BeautifulSoup 在文本字符串之后找到表格? [英] How can I find a table after a text string using BeautifulSoup in Python?

查看:12
本文介绍了如何在 Python 中使用 BeautifulSoup 在文本字符串之后找到表格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从几个网页中提取数据,这些网页的表格显示方式不一致.我需要编写代码来搜索文本字符串,然后立即转到该特定文本字符串后面的表格.然后我想提取该表的内容.这是我到目前为止所得到的:

I am trying to extract data from several web pages which are not uniform in how they display their tables. I need to write code that will search for a text string and then go to the table immediately following that specific text string. Then I want to extract the contents of that table. Here's what I've got so far:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
print

但是,我收到以下错误:

However, I get the following error:

    soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'

是否有使用 BeautifulSoup 的简单方法?

Is there an easy way to do this using BeautifulSoup?

推荐答案

错误是由于 findAllNext 是一个Tag 对象的方法,而foundtext 是一个ResultSet 对象,它是一个list 匹配的标签或字符串.您可以遍历 foundtext 中的每个标签,但根据您的需要,使用 find,只返回第一个匹配的标签.

The error is due to the fact that findAllNext is a method of Tag objects, but foundtext is a ResultSet object, which is a list of matching tags or strings. You could iterate through the each of the tags in foundtext, but depending on your needs it might be sufficient to use find, which returns only the first matching tag.

这是您的代码的修改版本.将 foundtext 更改为使用 soup.find 后,我发现并修复了与 table 相同的问题.我将您的正则表达式修改为 忽略单词之间的空格:

Here's a modified version of your code. After changing foundtext to use soup.find, I found and fixed the same problem with table. I modified your regex to ignore whitespace between the words:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Tables+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print 

输出:

1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|

这篇关于如何在 Python 中使用 BeautifulSoup 在文本字符串之后找到表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆