我如何才能找到在Python用BeautifulSoup文本字符串后的表? [英] How can I find a table after a text string using BeautifulSoup in Python?

查看:111
本文介绍了我如何才能找到在Python用BeautifulSoup文本字符串后的表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从几个网页,这是不是在他们如何展示自己的表统一提取数据。我需要写code,将搜索文本字符串,然后转至表立即特定的文本字符串以下。然后我想提取表的内容。下面是我到目前为止有:

I am trying to extract data from several web pages which are not uniform in how they display their tables. I need to write code that will search for a text string and then go to the table immediately following that specific text string. Then I want to extract the contents of that table. Here's what I've got so far:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
print

不过,我得到以下错误:

However, I get the following error:

    soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'

有一个简单的方法来做到这一点使用BeautifulSoup?

Is there an easy way to do this using BeautifulSoup?

推荐答案

的错误是由于这样的事实,<一href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#findAllNext%28name,%20attrs,%20text,%20limit,%20%2a%2akwargs%29%20and%20findNext%28name,%20attrs,%20text,%20%2a%2akwargs%29\"><$c$c>findAllNext标签的方法的对象,但 foundtext 的ResultSet 对象,这是匹配标签或字符串的列表的。您可以通过各个标签在 foundtext 迭代,但根据你的需求可能就足够使用<一个href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#find%28name,%20attrs,%20recursive,%20text,%20%2a%2akwargs%29\"><$c$c>find,仅返回第一个匹配的标签。

The error is due to the fact that findAllNext is a method of Tag objects, but foundtext is a ResultSet object, which is a list of matching tags or strings. You could iterate through the each of the tags in foundtext, but depending on your needs it might be sufficient to use find, which returns only the first matching tag.

这是你的code的修改版本。改变后 foundtext 使用 soup.find ,我发现与表固定了同样的问题。我修改你的正则表达式<一个href=\"http://stackoverflow.com/questions/4590298/how-to-ignore-whitespace-in-a-regular-ex$p$pssion-subject-string\">ignore单词之间的空格的:

Here's a modified version of your code. After changing foundtext to use soup.find, I found and fixed the same problem with table. I modified your regex to ignore whitespace between the words:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print 

此输出:

1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|

这篇关于我如何才能找到在Python用BeautifulSoup文本字符串后的表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆