beautifulSoup HTML CSV [英] beautifulSoup html csv
问题描述
晚上好,我使用BeautifulSoup从网站中提取了一些数据,如下所示:
Good evening, I have used BeautifulSoup to extract some data from a website as follows:
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))
table = soup.findAll('table', attrs={ "class" : "table-horizontal-line"})
print table
这将提供以下输出:
[<table width="70%" class="table-horizontal-line">
<tr>
<th>Amount</th>
<th>Company or person fined</th>
<th>Date</th>
<th>What was the fine for?</th>
<th>Compensation</th>
</tr>
<tr>
<td><a name="_Hlk74714257" id="_Hlk74714257"> </a>£4,000,000</td>
<td><a href="/pages/library/communication/pr/2002/124.shtml">Credit Suisse First Boston International </a></td>
<td>19/12/02</td>
<td>Attempting to mislead the Japanese regulatory and tax authorities</td>
<td> </td>
</tr>
<tr>
<td>£750,000</td>
<td><a href="/pages/library/communication/pr/2002/123.shtml">Royal Bank of Scotland plc</a></td>
<td>17/12/02</td>
<td>Breaches of money laundering rules</td>
<td> </td>
</tr>
<tr>
<td>£1,000,000</td>
<td><a href="/pages/library/communication/pr/2002/118.shtml">Abbey Life Assurance Company ltd</a></td>
<td>04/12/02</td>
<td>Mortgage endowment mis-selling and other failings</td>
<td>Compensation estimated to be between £120 and £160 million</td>
</tr>
<tr>
<td>£1,350,000</td>
<td><a href="/pages/library/communication/pr/2002/087.shtml">Royal & Sun Alliance Group</a></td>
<td>27/08/02</td>
<td>Pension review failings</td>
<td>Redress exceeding £32 million</td>
</tr>
<tr>
<td>£4,000</td>
<td><a href="/pubs/final/ft-inv-ins_7aug02.pdf" target="_blank">F T Investment & Insurance Consultants</a></td>
<td>07/08/02</td>
<td>Pensions review failings</td>
<td> </td>
</tr>
<tr>
<td>£75,000</td>
<td><a href="/pubs/final/spe_18jun02.pdf" target="_blank">Seymour Pierce Ellis ltd</a></td>
<td>18/06/02</td>
<td>Breaches of FSA Principles ("skill, care and diligence" and "internal organization")</td>
<td> </td>
</tr>
<tr>
<td>£120,000</td>
<td><a href="/pages/library/communication/pr/2002/051.shtml">Ward Consultancy plc</a></td>
<td>14/05/02</td>
<td>Pension review failings</td>
<td> </td>
</tr>
<tr>
<td>£140,000</td>
<td><a href="/pages/library/communication/pr/2002/036.shtml">Shawlands Financial Services ltd</a> - formerly Frizzell Life & Financial Planning ltd)</td>
<td>11/04/02</td>
<td>Record keeping and associated compliance breaches</td>
<td> </td>
</tr>
<tr>
<td>£5,000</td>
<td><a href="/pubs/final/woodwards_4apr02.pdf" target="_blank">Woodward's Independent Financial Advisers</a></td>
<td>04/04/02</td>
<td>Pensions review failings</td>
<td> </td>
</tr>
</table>]
我想将其导出为CSV,同时保持网站上显示的表格结构,这是否可能?如果可以,怎么办?
I would like to export this into CSV whilst keeping the table structure as displayed on the website, is this possible and if so how?
预先感谢您的帮助.
推荐答案
这是您可以尝试的基本方法.这假定headers
都在<th>
标记中,并且所有后续数据都在<td>
标记中.这在您提供的单个情况下有效,但我确定在其他情况下需要进行调整:)一般的想法是,一旦找到您的table
(此处使用find
拉出第一个),我们就可以通过遍历所有th
元素并将它们存储在列表中来创建headers
.然后,我们创建一个rows
列表,其中将包含代表每一行内容的列表.这是通过在tr
标记下找到所有td
元素并采用text
并将其编码为UTF-8(来自Unicode)来填充的.然后,您打开一个CSV文件,先写headers
,然后写所有rows, but using
(如果有行则逐行)以消除任何空白行):
Here is a basic thing you can try. This makes the assumption that the headers
are all in the <th>
tags, and that all subsequent data is in the <td>
tags. This works in the single case you provided, but I'm sure adjustments will be necessary if other cases :) The general idea is that once you find your table
(here using find
to pull the first one), we get the headers
by iterating through all th
elements, storing them in a list. Then, we create a rows
list that will contain lists representing the contents of each row. This is populated by finding all td
elements under tr
tags and taking the text
, encoding it in UTF-8 (from Unicode). You then open a CSV, writing the headers
first and then writing all of the rows, but using
(row for row in rows if row)` to eliminate any blank rows):
In [117]: import csv
In [118]: from bs4 import BeautifulSoup
In [119]: from urllib2 import urlopen
In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))
In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"})
In [122]: headers = [header.text for header in table.find_all('th')]
In [123]: rows = []
In [124]: for row in table.find_all('tr'):
.....: rows.append([val.text.encode('utf8') for val in row.find_all('td')])
.....:
In [125]: with open('output_file.csv', 'wb') as f:
.....: writer = csv.writer(f)
.....: writer.writerow(headers)
.....: writer.writerows(row for row in rows if row)
.....:
In [126]: cat output_file.csv
Amount,Company or person fined,Date,What was the fine for?,Compensation
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities,
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules,
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings,
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")",
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings,
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches,
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings,
这篇关于beautifulSoup HTML CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!