使用 BeautifulSoup 从表中提取选定的列 [英] Extracting selected columns from a table using BeautifulSoup

查看:25
本文介绍了使用 BeautifulSoup 从表中提取选定的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取此数据表的第一列和第三列 使用 BeautifulSoup.从查看 HTML 来看,第一列有一个 标记.另一个感兴趣的列具有 标记.无论如何,我所能得到的只是带有标签的列的列表.但是,我只想要文字.

I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a <th> tag. The other column of interest has as <td> tag. In any case, all I've been able to get out is a list of the column with the tags. But, I just want the text.

table 已经是一个列表,所以我不能使用 findAll(text=True).我不确定如何以另一种形式获取第一列的列表.

table is already a list so I can't use findAll(text=True). I'm not sure how to get the listing of the first column in another form.

from BeautifulSoup import BeautifulSoup
from sys import argv
import re

filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one

print table

推荐答案

你可以试试这个代码:

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

如您所见,代码只是连接到 url 并获取 html,BeautifulSoup 找到第一个表,然后所有 'tr' 并选择第一列,即 'th',以及第三列,这是一个td".

As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'.

这篇关于使用 BeautifulSoup 从表中提取选定的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆