有没有干净的方式来获得使用BeautifulSoup HTML表格的第n列? [英] Is there a clean way to get the n-th column of an html table using BeautifulSoup?

查看:804
本文介绍了有没有干净的方式来获得使用BeautifulSoup HTML表格的第n列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们看一下第一个表中的一个页面,所以:

Say we look at the first table in a page, so:

table = BeautifulSoup(...).table

行可以与扫描干净的for循环:

the rows can be scanned with a clean for-loop:

for row in table:
    f(row)

但得到一个单列事情变得复杂。

But for getting a single column things get messy.

我的问题:是否有一种优雅的方式来提取单个列,无论是由它的位置,或者由它的名字(出现此列的第一行文字,即)

My question: is there an elegant way to extract a single column, either by its position, or by its 'name' (i.e. text that appears in the first row of this column)?

推荐答案

LXML 比BeautifulSoup快许多倍,所以你可能想使用

lxml is many times faster than BeautifulSoup, so you might want to use that.

from lxml.html import parse
doc = parse('http://python.org').getroot()
for row in doc.cssselect('table > tr'):
    for cell in row.cssselect('td:nth-child(3)'):
         print cell.text_content()

或者,而不是循环:

Or, instead of looping:

rows = [ row for row in doc.cssselect('table > tr') ]
cells = [ cell.text_content() for cell in rows.cssselect('td:nth-child(3)') ]
print cells

这篇关于有没有干净的方式来获得使用BeautifulSoup HTML表格的第n列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆