使用 BeautifulSoup 从表中提取选定的列 [英] Extracting selected columns from a table using BeautifulSoup
问题描述
我正在尝试提取此数据表的第一列和第三列 使用 BeautifulSoup.从查看 HTML 来看,第一列有一个 I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a 你可以试试这个代码: 如您所见,代码只是连接到 url 并获取 html,BeautifulSoup 找到第一个表,然后所有 'tr' 并选择第一列,即 'th',以及第三列,这是一个td". As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'. 这篇关于使用 BeautifulSoup 从表中提取选定的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! 标记.另一个感兴趣的列具有 标记.无论如何,我所能得到的只是带有标签的列的列表.但是,我只想要文字.
<th>
tag. The other column of interest has as <td>
tag. In any case, all I've been able to get out is a list of the column with the tags. But, I just want the text. table
已经是一个列表,所以我不能使用 findAll(text=True)
.我不确定如何以另一种形式获取第一列的列表.table
is already a list so I can't use findAll(text=True)
. I'm not sure how to get the listing of the first column in another form. from BeautifulSoup import BeautifulSoup
from sys import argv
import re
filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one
print table
推荐答案
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print first_column, third_column
登录
关闭