Beautifulsoup HTML表解析-仅能获取最后一行? [英] Beautifulsoup HTML table parsing--only able to get the last row?

查看:35
本文介绍了Beautifulsoup HTML表解析-仅能获取最后一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的HTML表可以解析,但是不知怎的,Beautifulsoup只能从最后一行获取结果.我想知道是否有人会看一下这是怎么回事.因此,我已经从HTML表创建了rows对象:

I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:

 <table class='participants-table'>
    <thead>
      <tr>
          <th data-field="name" class="sort-direction-toggle name">Name</th>
          <th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
          <th data-field="sector" class="sort-direction-toggle sector">Sector</th>
          <th data-field="country" class="sort-direction-toggle country">Country</th>
          <th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
      </tr>
    </thead>
    <tbody>
        <tr>
          <th class='name'><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
          <td class='type'>Company</td>
          <td class='sector'>General Industrials</td>
          <td class='country'>Netherlands</td>
          <td class='joined-on'>2000-09-20</td>
        </tr>
        <tr>
          <th class='name'><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
          <td class='type'>Company</td>
          <td class='sector'>Pharmaceuticals &amp; Biotechnology</td>
          <td class='country'>Portugal</td>
          <td class='joined-on'>2004-02-19</td>
        </tr>
    </tbody>
  </table>

我使用以下代码获取行:

I use the following codes to get the rows:

table=soup.find_all("table", class_="participants-table")
table1=table[0]
rows=table1.find_all('tr')
rows=rows[1:]

得到:

rows=[<tr>
 <th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
 <td class="type">Company</td>
 <td class="sector">General Industrials</td>
 <td class="country">Netherlands</td>
 <td class="joined-on">2000-09-20</td>
 </tr>, <tr>
 <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
 <td class="type">Company</td>
 <td class="sector">Pharmaceuticals &amp; Biotechnology</td>
 <td class="country">Portugal</td>
 <td class="joined-on">2004-02-19</td>
 </tr>]

如预期的那样.但是,如果我继续:

As expected, it looks like. However, if I continue:

for row in rows:
    cells = row.find_all('th')

我只能获得最后一个条目!

I'm only able to get the last entry!

cells=[<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

这是怎么回事?这是我第一次使用beautifulsoup,我想做的就是将此表导出为CSV.任何帮助是极大的赞赏!谢谢

What is going on? This is my first time using beautifulsoup, and what I'd like to do is to export this table into CSV. Any help is greatly appreciated! Thanks

推荐答案

如果要将所有th标签都放在一个列表中,则需要扩展,只需重新分配 cells = row.find_all('th'),所以当您的打印单元在循环之外时,您只会看到最后分配给它的内容,即最后一个tr中的最后一个:

You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th') so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:

cells = []
for row in rows:
 cells.extend(row.find_all('th'))

此外,由于只有一张表,您可以使用查找:

Also since there is only one table you can just use find:

soup = BeautifulSoup(html)

table = soup.find("table", class_="participants-table")

如果要跳过thead行,可以使用 css选择器:

If you want to skip the thead row you can use a css selector:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

rows = soup.select("table.participants-table  thead ~ tr")

cells = [tr.th for tr in rows]
print(cells)

单元格会给您:

[<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

要将整个表写入csv:

To write the whole table to csv:

import csv

soup = BeautifulSoup(html, "html.parser")

rows = soup.select("table.participants-table tr")

with open("data.csv", "w") as out:
    wr = csv.writer(out)
    wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])

    for row in rows[1:]:
        wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])

为您提供的样品将为您提供

which for you sample will give you:

Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

这篇关于Beautifulsoup HTML表解析-仅能获取最后一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆