beautifulSoup HTML CSV [英] beautifulSoup html csv

查看:72
本文介绍了beautifulSoup HTML CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

晚上好,我使用BeautifulSoup从网站中提取了一些数据,如下所示:

Good evening, I have used BeautifulSoup to extract some data from a website as follows:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

table = soup.findAll('table', attrs={ "class" : "table-horizontal-line"})

print table

这将提供以下输出:

[<table width="70%" class="table-horizontal-line">
<tr>
<th>Amount</th>
<th>Company or person fined</th>
<th>Date</th>
<th>What was the fine for?</th>
<th>Compensation</th>
</tr>
<tr>
<td><a name="_Hlk74714257" id="_Hlk74714257">&#160;</a>£4,000,000</td>
<td><a href="/pages/library/communication/pr/2002/124.shtml">Credit Suisse First Boston International </a></td>
<td>19/12/02</td>
<td>Attempting to mislead the Japanese regulatory and tax authorities</td>
<td>&#160;</td>
</tr>
<tr>
<td>£750,000</td>
<td><a href="/pages/library/communication/pr/2002/123.shtml">Royal Bank of Scotland plc</a></td>
<td>17/12/02</td>
<td>Breaches of money laundering rules</td>
<td>&#160;</td>
</tr>
<tr>
<td>£1,000,000</td>
<td><a href="/pages/library/communication/pr/2002/118.shtml">Abbey Life Assurance Company ltd</a></td>
<td>04/12/02</td>
<td>Mortgage endowment mis-selling and other failings</td>
<td>Compensation estimated to be between £120 and £160 million</td>
</tr>
<tr>
<td>£1,350,000</td>
<td><a href="/pages/library/communication/pr/2002/087.shtml">Royal &#38; Sun Alliance Group</a></td>
<td>27/08/02</td>
<td>Pension review failings</td>
<td>Redress exceeding £32 million</td>
</tr>
<tr>
<td>£4,000</td>
<td><a href="/pubs/final/ft-inv-ins_7aug02.pdf" target="_blank">F T Investment &#38; Insurance Consultants</a></td>
<td>07/08/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£75,000</td>
<td><a href="/pubs/final/spe_18jun02.pdf" target="_blank">Seymour Pierce Ellis ltd</a></td>
<td>18/06/02</td>
<td>Breaches of FSA Principles ("skill, care and diligence" and "internal organization")</td>
<td>&#160;</td>
</tr>
<tr>
<td>£120,000</td>
<td><a href="/pages/library/communication/pr/2002/051.shtml">Ward Consultancy plc</a></td>
<td>14/05/02</td>
<td>Pension review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£140,000</td>
<td><a href="/pages/library/communication/pr/2002/036.shtml">Shawlands Financial Services ltd</a> - formerly Frizzell Life &#38; Financial Planning ltd)</td>
<td>11/04/02</td>
<td>Record keeping and associated compliance breaches</td>
<td>&#160;</td>
</tr>
<tr>
<td>£5,000</td>
<td><a href="/pubs/final/woodwards_4apr02.pdf" target="_blank">Woodward's Independent Financial Advisers</a></td>
<td>04/04/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
</table>]

我想将其导出为CSV,同时保持网站上显示的表格结构,这是否可能?如果可以,怎么办?

I would like to export this into CSV whilst keeping the table structure as displayed on the website, is this possible and if so how?

预先感谢您的帮助.

推荐答案

这是您可以尝试的基本方法.这假定headers都在<th>标记中,并且所有后续数据都在<td>标记中.这在您提供的单个情况下有效,但我确定在其他情况下需要进行调整:)一般的想法是,一旦找到您的table(此处使用find拉出第一个),我们就可以通过遍历所有th元素并将它们存储在列表中来创建headers.然后,我们创建一个rows列表,其中将包含代表每一行内容的列表.这是通过在tr标记下找到所有td元素并采用text并将其编码为UTF-8(来自Unicode)来填充的.然后,您打开一个CSV文件,先写headers,然后写所有rows, but using(如果有行则逐行)以消除任何空白行):

Here is a basic thing you can try. This makes the assumption that the headers are all in the <th> tags, and that all subsequent data is in the <td> tags. This works in the single case you provided, but I'm sure adjustments will be necessary if other cases :) The general idea is that once you find your table (here using find to pull the first one), we get the headers by iterating through all th elements, storing them in a list. Then, we create a rows list that will contain lists representing the contents of each row. This is populated by finding all td elements under tr tags and taking the text, encoding it in UTF-8 (from Unicode). You then open a CSV, writing the headers first and then writing all of the rows, but using(row for row in rows if row)` to eliminate any blank rows):

In [117]: import csv

In [118]: from bs4 import BeautifulSoup

In [119]: from urllib2 import urlopen

In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"})

In [122]: headers = [header.text for header in table.find_all('th')]

In [123]: rows = []

In [124]: for row in table.find_all('tr'):
   .....:     rows.append([val.text.encode('utf8') for val in row.find_all('td')])
   .....: 

In [125]: with open('output_file.csv', 'wb') as f:
   .....:     writer = csv.writer(f)
   .....:     writer.writerow(headers)
   .....:     writer.writerows(row for row in rows if row)
   .....: 

In [126]: cat output_file.csv
Amount,Company or person fined,Date,What was the fine for?,Compensation
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities, 
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules, 
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings, 
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")", 
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings, 
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches, 
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings, 

这篇关于beautifulSoup HTML CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆