如何在 Python 中从网站中提取表格 [英] How to extract tables from websites in Python

查看:67
本文介绍了如何在 Python 中从网站中提取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里,

http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500

有一张桌子.我的目标是提取表格并将其保存到 csv 文件中.我写了一段代码:

There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:

import urllib
import os

web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")

s = web.read()
web.close()

ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()

我从这里输了.任何人都可以帮助解决这个问题?谢谢!

I lost from here. Anyone who can help on this? Thanks!

推荐答案

所以本质上你想解析 html 文件以从中获取元素.您可以使用 BeautifulSouplxml 用于此任务.

So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.

您已经有了使用 BeautifulSoup 的解决方案.我将使用 lxml 发布解决方案:

You already have solutions using BeautifulSoup. I'll post a solution using lxml:

from lxml import etree
import urllib.request

web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

这篇关于如何在 Python 中从网站中提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆