如何使用 BeautifulSoup 从特定表中获取所有行? [英] How do you get all the rows from a particular table using BeautifulSoup?
问题描述
我正在学习 Python 和 BeautifulSoup 从网络上抓取数据,并读取 HTML 表格.我可以将它读入 Open Office,它说它是 Table #11.
看起来 BeautifulSoup 是首选,但谁能告诉我如何获取特定表和所有行?我已经查看了模块文档,但无法理解它.我在网上找到的许多示例似乎都超出了我的需要.
如果您有一大块 HTML 需要用 BeautifulSoup 解析,这应该很简单.一般的想法是使用 findChildren
方法导航到您的表格,然后您可以使用 string
属性获取单元格内的文本值.
I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into Open Office and it says that it is Table #11.
It seems like BeautifulSoup is the preferred choice, but can anyone tell me how to grab a particular table and all the rows? I have looked at the module documentation, but can't get my head around it. Many of the examples that I have found online appear to do more than I need.
This should be pretty straight forward if you have a chunk of HTML to parse with BeautifulSoup. The general idea is to navigate to your table using the findChildren
method, then you can get the text value inside the cell with the string
property.
>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = """
... <html>
... <body>
... <table>
... <th><td>column 1</td><td>column 2</td></th>
... <tr><td>value 1</td><td>value 2</td></tr>
... </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
... cells = row.findChildren('td')
... for cell in cells:
... value = cell.string
... print("The value in this cell is %s" % value)
...
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>>
这篇关于如何使用 BeautifulSoup 从特定表中获取所有行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!