如何使用 BeautifulSoup 从特定表中获取所有行? [英] How do you get all the rows from a particular table using BeautifulSoup?

查看:25
本文介绍了如何使用 BeautifulSoup 从特定表中获取所有行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 Python 和 BeautifulSoup 从网络上抓取数据,并读取 HTML 表格.我可以将它读入 Open Office,它说它是 Table #11.

看起来 BeautifulSoup 是首选,但谁能告诉我如何获取特定表和所有行?我已经查看了模块文档,但无法理解它.我在网上找到的许多示例似乎都超出了我的需要.

解决方案

如果您有一大块 HTML 需要用 BeautifulSoup 解析,这应该很简单.一般的想法是使用 findChildren 方法导航到您的表格,然后您可以使用 string 属性获取单元格内的文本值.

<预><代码>>>>从 BeautifulSoup 导入 BeautifulSoup>>>>>>html = """... <html>... <身体>...<表格>... <th><td>第1列</td><td>第2列</td></th>... <tr><td>值 1</td><td>值 2</td></tr>... </table>... </body>... </html>……">>>>>>汤 = BeautifulSoup(html)>>>表 = 汤.findChildren('table')>>>>>># 这将获得第一个(也是唯一一个)表.您的页面可能有更多.>>>my_table = 表 [0]>>>>>># 你可以通过传递一个字符串列表来找到具有多个标签的孩子>>>行 = my_table.findChildren(['th', 'tr'])>>>>>>对于行中的行:... cell = row.findChildren('td')...对于单元格中的单元格:... 值 = cell.string... print("该单元格中的值为 %s" % value)...此单元格中的值是第 1 列此单元格中的值是第 2 列此单元格中的值为值 1此单元格中的值为值 2>>>

I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into Open Office and it says that it is Table #11.

It seems like BeautifulSoup is the preferred choice, but can anyone tell me how to grab a particular table and all the rows? I have looked at the module documentation, but can't get my head around it. Many of the examples that I have found online appear to do more than I need.

解决方案

This should be pretty straight forward if you have a chunk of HTML to parse with BeautifulSoup. The general idea is to navigate to your table using the findChildren method, then you can get the text value inside the cell with the string property.

>>> from BeautifulSoup import BeautifulSoup
>>> 
>>> html = """
... <html>
... <body>
...     <table>
...         <th><td>column 1</td><td>column 2</td></th>
...         <tr><td>value 1</td><td>value 2</td></tr>
...     </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
...     cells = row.findChildren('td')
...     for cell in cells:
...         value = cell.string
...         print("The value in this cell is %s" % value)
... 
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>> 

这篇关于如何使用 BeautifulSoup 从特定表中获取所有行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆