刮桌与BeautifulSoup [英] Scrape table with BeautifulSoup

查看:71
本文介绍了刮桌与BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的表结构:

I have a table structure that looks like this :

<tr><td>
<td>
<td bgcolor="#E6E6E6" valign="top" align="left">testtestestes</td>
</tr>
<tr nowrap="nowrap" valign="top" align="left">
<td nowrap="nowrap">8-K</td>
<td class="small">Current report, items 1.01, 3.02, and 9.01
<br>Accession Number: 0001283140-16-000129 &nbsp;Act: 34 &nbsp;Size:&nbsp;520 KB
</td>
<td nowrap="nowrap">2016-09-19<br>17:30:01</td>
 <td nowrap="nowrap">2016-09-19</td><td align="left" nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=001-03473&amp;owner=include&amp;count=100">001-03473</a>
<br/>161891888</td></tr>

那是一行数据.这是我使用beautifulSoup的脚本.我可以很好地获得< tr> < td> .但是它们在单独的列表中.

That is one row of data. This is my script using beautifulSoup. I can get the <tr> and <td> just fine. But they are in a separate list.

for tr in (soup.find_all('tr')):
        tds = tr.find_all('td')
        print tds

我的问题是如何从两个单独的< tr> 中获取数据,并使它们看起来像是一行数据.我正在尝试获取< td>

My problem is how can I get the data from two separate <tr> and make it look like they're one row of data. I am trying to get the text between <td>

推荐答案

如果要将它们配对,请从 soup.find_all('tr')和 zip 成对:

If you want to pair them up, create an iterator from soup.find_all('tr') and zip them into pairs:

it = iter(soup.find_all('tr'))
for tr1, tr2  in zip(it, it):
        tds = tr1.find_all('td') + tr2.find_all("td")
        print(tds)

与切片等效的是从另一个起始位置开始,并使用步骤2:

The equivalent with slicing would be to start with a different start pos and use a step of 2:

it = soup.find_all('tr')
for tr1, tr2  in zip(it[::2], it[1::2]):
        tds = tr1.find_all('td') + tr2.find_all("td")
        print(tds)

使用 iter 意味着您无需浅表复制列表.

Using iter means you don't need to shallow copy the list.

不确定是否会有大量的trs符合逻辑,因为没有任何东西可以配对,但是如果有的话,您可以使用

Not sure how having an uneven amount of trs fits into the logic as there would be nothing to pair but if there is you can use izip_longest:

from itertools import izip_longest # python3 zip_longest

it = iter(soup.find_all('tr'))
for tr1, tr2  in izip_longest(it, it):
        tds = tr1.find_all('td') + tr2.find_all("td") if tr2 else []
        print(tds)

这篇关于刮桌与BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆