使用BeautifulSoup刮擦桌子 [英] Scraping a table using BeautifulSoup

查看:84
本文介绍了使用BeautifulSoup刮擦桌子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个我很直截了当的问题.我有以下类型的页面,我希望从该页面中收集上一张表中的信息(如果一直向下滚动,则是过程"框中的那个页面):

I have a question which i suspect is fairly straight forward. I have the following type of page from which I want to collect the information in the last table (if you scroll all the way down it is the one in the box labelled "Procedure"):

我要抓取的表格的html如下所示:

The html for the table I want to scrape looks like this:

<tbody><tr class="doc_title">
<td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="left" valign="top"><img src="/img/struct/functional/arrow_title_doc.gif" alt="" align="absmiddle" border="0" height="14" width="8"> <span style="font-weight: bold;">PROCEDURE</span></td><td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="right" valign="top">
<table border="0" cellpadding="3" cellspacing="0" width="50">
<tbody><tr><td align="center"><a href="#top"><img src="/img/struct/functional/top_doc.gif" alt="" border="0" height="16" width="16"></a></td><td align="center"><img src="/img/struct/navigation/spacer.gif" alt="" border="0" height="10" width="15"></td><td align="center"><a href="#title2"><img src="/img/struct/functional/sort_up.gif" alt="" border="0" height="10" width="15"></a></td></tr></tbody></table></td></tr>

<tr class="contents" valign="top"><td colspan="2">
<p></p><table style="border-collapse: collapse; width: 481.85pt;" align="center" cellspacing="0">
<tbody><tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Title</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Mutual assistance for the recovery of claims relating to taxes, duties and other measures</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">References</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style=""><a href="http://ec.europa.eu/prelex/liste_resultats.cfm?CL=en&amp;ReqId=0&amp;DocType=COM&amp;DocYear=2009&amp;DocNum=0028">COM(2009)0028</a> – C6-0061/2009 – <a href="/oeil/FindByProcnum.do?lang=en&amp;procnum=CNS/2009/0007">2009/0007(CNS)</a></p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date of consulting Parliament</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">16.2.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee responsible</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">ECON</p>

<p style="">19.10.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee(s) asked for opinion(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Not delivering opinions</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date of decision</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">1.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">5.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Rapporteur(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date appointed</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="3">
<p style="">Theodor Dumitru Stolojan</p>

<p style="">21.7.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Discussed in committee</span></p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">10.11.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">1.12.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">21.1.2010</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date adopted</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">27.1.2010</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Result of final vote</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 12.94%;" rowspan="1" colspan="1">
<p style="">+:</p>

<p style="">–:</p>

<p style="">0:</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 48.82%;" rowspan="1" colspan="6">
<p style="">39</p>

<p style="">0</p>

<p style="">1</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Members present for the final vote</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Burkhard Balz, Sharon Bowles, Udo Bullmann, Pascal Canfin, Nikolaos Chountis, George Sabin Cutaş, Leonardo Domenici, Derk Jan Eppink, Markus Ferber, Elisa Ferreira, Vicky Ford, José Manuel García-Margallo y Marfil, Jean-Paul Gauzès, Sylvie Goulard, Enikő Győri, Liem Hoang Ngoc, Eva Joly, Othmar Karas, Wolf Klinz, Jürgen Klute, Werner Langen, Astrid Lulling, Arlene McCarthy, Ivari Padar, Alfredo Pallone, Anni Podimata, Antolín Sánchez Presedo, Olle Schmidt, Edward Scicluna, Peter Simon, Peter Skinner, Theodor Dumitru Stolojan, Ivo Strejček, Kay Swinburne, Marianne Thyssen, Ramon Tremosa i Balcells</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Substitute(s) present for the final vote</span></p>
</td>
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Marta Andreasen, Sophie Briard Auconie, David Casa, Danuta Jazłowiecka, Arturs Krišjānis Kariņš, Philippe Lamberts, Andreas Schwab</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 38.24%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 12.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 2.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 4.71%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10.58%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 5.29%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 15.3%;" rowspan="1" colspan="1"></td>
<td style="" rowspan="1" colspan="1"></td></tr>
</tbody></table>
</td></tr>
</tbody>

我面临的问题是表格的标签没有标识符(据我所知),所以我不知道如何选择该表格并从中获取信息.到目前为止,我一直在使用BeautifilSoup来从网站上获取其他信息,但是我对如何刮擦这张桌子一无所知.

The problem I am facing is that the tags for the tables do not have identifiers (as far as I can tell), so I dont know how to select this one table and scrape the information from it. I have been using BeautifilSoup so far to get other information from the website, but I am at a loss for how to scrape this one table.

如果有人可以告诉我如何进行操作,我将不胜感激!

If anyone can show me how to proceed I would ne most grateful!

以诚挚的问候,

托马斯

推荐答案

如果您比较聪明,可以通过其他属性查找元素.我是在抓取您的数据时拍摄的,这可能不是最好的–但是,它可以让您靠近.

You can find elements by other attributes if you're a bit clever. I took this shot at scraping your data, and it probably isn't the best – but, it gets you close.

我注意到的第一件事是您肯定在第二次出现"PROCEDURE"一词之后才想要数据(第一个是链接,第二个是标题).因此,我对此进行了划分:

The first thing I noticed was you definitely wanted data after the second appearance of the word "PROCEDURE" (first being the link, second being the header). So, I split on that:

data = html.split("PROCEDURE", 2)[2]

然后,我用rowspan=1寻找<td>标记:

bs = BeautifulSoup.BeautifulSoup(data)
tds = bs.findAll("td", { "rowspan": 1 })

越来越近...

>>> tds[0].text
u'Title'
>>> tds[1].text
u'Mutual assistance for the recovery of claims relating to taxes, duties and other measures'
>>> tds[3].text
u'References'
>>> tds[4].text
u'COM(2009)00282009/0007(CNS)2009 a>'

请注意,我跳过了tds中的索引2,因为它们使用了分隔符或其他内容(它是空的).无论如何,这是一个开始.我在BeautifulSoup中发现的真正诀窍是仅将数据提供给您知道要查找的区域,因为这样一来,浏览的内容就更少了.它也以接受糟糕的输入而感到自豪,因此不要害怕将其作为垃圾.

Note that I skipped index 2 in tds, since they use a spacer or something (it's empty). Anyway, that's a start. The real trick I found with BeautifulSoup was to only feed it the data in the area you know you're looking for, because then there's less to skim through. It prides itself on accepting bad-looking input, too, so don't be afraid to feed it garbage.

我在元素列表中走得更远,但这并不完美.您需要优化搜索,因为它们在<td> s内具有<td>个元素作为值.

I went a bit further in the list of elements, and it isn't perfect. You'll need to refine the search, since they have <td> elements within <td>s for the values.

这篇关于使用BeautifulSoup刮擦桌子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆