使用bs4从表中除表头的信息提取信息 [英] Extracting information from a table except header of the table using bs4

查看:175
本文介绍了使用bs4从表中除表头的信息提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用bs4和python从表中提取信息。
当我使用下面的代码从表头中提取信息时:

$ p $ tr_header = table.findAll tr)[0]
tds_in_header = [td.get_text()for td in tr_header.findAll(td)]
header_items = [data.encode('utf-8')for data in tds_in_header]
len_table_header = len(header_items)

它可以工作,但对于下面的代码我试图从第一行提取信息到表格末尾:

  tr_all = table.findAll(tr )[1:] 
tds_all = [td.get_text()for td in tr_all.findAll(td)]
table_info = [data.encode('utf-8')for data in tds_all]

有以下错误:

< pre $ AttributeError:'list'对象没有属性'findAll'



<任何人都可以帮我编辑它。


这是表格信息:

 < table class =码>< TR>< TD>< b取代;代码< / b个
< / td>< td>< b>显示< / b>< / td>< td>< b>定义< / b>< / td>
< / tr>< tr>< td> active< a name =active> < / A>< / TD>
< td>活动< / td>< td>此帐户已启用并可以使用。< / td>< / tr>
< tr>< td>无效< a名称=无效> < / A>< / TD>
< td>无效< / td>< td>此帐户不活动
,不应用于追踪财务信息。< / td>< / tr>< / table>

这是tr_all的输出:

 并[d TR>< TD>< b取代;代码< / b>< / TD>< TD>< b取代;显示< / b>< / TD> ;< td>< b>定义< / b>< / td>< / tr>,< tr>< td> active< a name =active> < / td>< td>活动< / td>< td>此帐户已启用并可以使用。< / td>< / tr>< tr>< td> ;无效< a名称=无效> < / td>< td>无效< / td>< td>此帐户处于非活动状态,不应用于追踪财务信息。< / td>< / tr> 


解决方案

对于第一个问题,

 进口bs4 

text =
< table class =codes>< tr><<显示< / b>< / td>< td>< b>< b>定义< / b>< b>< / td>
< / tr>< tr>< td> active< a name =active>< / a>< / td>
< < / td>< / tr>< td>有效< td>< td>无效< a名称=无效 < / t>< / td>
< td>无效< / td>< td>此帐户不活动
,不应该用于追踪财务信息< td>< / tr>< / table>

table = bs4.BeautifulSoup(text)
tr_ all = table.findAll(tr)[1:]
tds_all = []
for tr in tr_all:
tds_all.append([td.get_text()for td in tr。 findAll(td)])
#如果你喜欢双列表comprefension,而不是...
table_info = [数据[i] .encode('utf-8')为tds_all中的数据
为我在范围内(len(tds_all))]
print(table_info)

产量

  ['active','Active','inactive','Inactive'] 

关于第二个问题


tr_header = table .findAll(tr)[0]我没有得到一个列表


真, [] 是索引操作,它从列表中选择第一个元素,从而获得单个元素。 [1:] 是切片运算符(看看漂亮的教程,如果你需要更多的信息)。实际上,对于表头和其余行,每次调用table.findAll(tr)时,都会获得两次列表。当然,这是相当多余的。
如果你想从头文件和休息文件中分离出令牌,我想你可能会想要类似这样的内容

  tr_all = table .findAll(tr)
header = tr_all [0]
tr_rest = tr_all [1:]
tds_rest = []
header_data = [td.get_text()。encode ('utf-8')for td in header]

for tr in tr_rest:
tds_rest.append([td.get_text()for td in tr.findAll(td)) ])

和关于第三个问题


是否可以编辑此代码以将表信息从第一行添加到表尾?

给出你想要的输出在下面的评论中:

$ $ $ $ $ $ $ $ $ $ row_all = table.findAll(tr)
header = rows_all [0]
rows = rows_all [1:]

data = []
行中的行:
行中的td:
尝试:
data.append(td.get_text())
,但AttributeError:
continue
print(data )

#或多或少与上面相同,oneline
data = [td.get_text()for row in row for the td in row.findAll(td)]




$ b

> [u'active',u'Active',u'此帐户处于活动状态并可以使用。',u'inactive',u'Inactive',u'此帐户处于非活动状态,不应用于跟踪财务信息。']


I am trying to extracting information from a table using bs4 and python. when I am using the following code to extract information from header of the table:

    tr_header=table.findAll("tr")[0]
    tds_in_header = [td.get_text()  for td in tr_header.findAll("td")]
    header_items= [data.encode('utf-8')  for data in tds_in_header]
    len_table_header = len (header_items)

It works, but for the following codes that I am trying to extract information from the first row to the end of the table:

    tr_all=table.findAll("tr")[1:]
    tds_all = [td.get_text()  for td in tr_all.findAll("td")]
    table_info= [data.encode('utf-8')  for data in tds_all]

There is the following error:

AttributeError: 'list' object has no attribute 'findAll'

Can anyone help me to edit it.

This is table information:

    <table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>

This is the output for tr_all:

[<tr><td><b>Code</b></td><td><b>Display</b></td><td><b>Definition</b></td></tr>, <tr><td>active<a name="active"> </a></td><td>Active</td><td>This account is active and may be used.</td></tr>, <tr><td>inactive<a name="inactive"> </a></td><td>Inactive</td><td>This account is inactive and should not be used to track financial information.</td></tr>] 

解决方案

For Your first question,

import bs4

text = """
<table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>"""

table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
tds_all = []
for tr in tr_all:
    tds_all.append([td.get_text() for td in tr.findAll("td")])
    # if You prefer double list comprefension instead...
table_info = [data[i].encode('utf-8') for data in tds_all
                                      for i in range(len(tds_all))]
print(table_info)

yields

['active ', 'Active', 'inactive ', 'Inactive']

And regarding Your second question

tr_header=table.findAll("tr")[0] i do not get a list

True, [] is indexing operation, which selects first element from list, thus You get single element. [1:] is slicing operator (take a look at nice tutorial if You need more information).

Actually, You get list two times, for each call of table.findAll("tr") - for header and rest of rows. Sure, this is quite redundant. If You want to separate tokens from header and rest, I think You likely want something like this

tr_all = table.findAll("tr")
header = tr_all[0]
tr_rest = tr_all[1:] 
tds_rest = []
header_data = [td.get_text().encode('utf-8') for td in header]

for tr in tr_rest:
     tds_rest.append([td.get_text() for td in tr.findAll("td")])

and regarding third question

Is it possible to edit this code to add table information from the first row to the end of the table?

Given Your desired output in comments below:

rows_all = table.findAll("tr")
header = rows_all[0]
rows = rows_all[1:]

data = []
for row in rows:
    for td in row:
        try:
            data.append(td.get_text())
        except AttributeError:
            continue
print(data)

# or more or less same as above, oneline
data = [td.get_text() for row in rows for td in row.findAll("td")]

yields

[u'active', u'Active', u'This account is active and may be used.', u'inactive', u'Inactive', u'This account is inactive and should not be used to track financial information.']

这篇关于使用bs4从表中除表头的信息提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆