美丽的汤捕获表中的空值 [英] beautiful soup captures null values in a table
问题描述
对于下面的HTML代码,我使用beautifulsoup来捕获表格信息:
For the following piece of HTML code, I used beautifulsoup to capture the table information:
<table>
<tr>
<td><b>Code</b></td>
<td><b>Display</b></td>
</tr>
<tr>
<td>min</td>
<td>Minute</td><td/>
</tr>
<tr>
<td>happy </td>
<td>Hour</td><td/>
</tr>
<tr>
<td>daily </td>
<td>Day</td><td/>
</tr>
这是我的代码:
comments = [td.get_text() for td in table.findAll("td")]
Comments=[data.encode('utf-8') for data in comments]
如您所见,此表有两个标题:代码和显示以及行中的某些值。我的代码的预期输出应该是[代码,显示,分钟,分钟,快乐,小时,每日,每天]
As you see, this table has two headers: "code and display" and some values in rows. The expected output of my code should be [code, display, min, minutes, happy, Hour, daily, day]
但这是输出:
['Code', 'Display', 'min', 'Minute', '', 'happy ',
'Hour', '', 'daily ', 'Day', '']
输出有''在此表中未定义的注释中的第5,第8和第11个索引。我认为这可能是因为< / td>< td />
。
如何更改代码以不在输出中捕获u''?
The output has '' in 5th, 8th, and 11th indices in comments that are not defined in this table. I think it may because of </td><td/>
.
How can I change the code to not capture u'' in the output?
推荐答案
抱歉,我没有仔细阅读你的问题。你是对的,问题是空的< td />
标签。只需将您的生成器调整为仅包含带有文本的单元格:
Sorry, I hadn't read your question carefully enough. You're right, the problem is the empty <td/>
tags. Just adjust your generator to only include cells with text:
comments = [td.get_text() for td in table.findAll('td') if td.text]
编辑:我怀疑这是最有效的方法,但这只包括在第一行中有文字或相应的td的tds。
I doubt this is the most efficient way to do it, but this will only include tds that have either text or a corresponding td in the first row.
ths = table.tr.find_all('td')
tds_in_row = len(table.tr.next_sibling.find_all('td'))
tds = [
td.get_text()
for i, td in enumerate(table.find_all('td'))
if len(ths) > (i + 1) % tds_in_row or td.text
]
这篇关于美丽的汤捕获表中的空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!