使用美丽汤从表中提取日期 [英] Using Beautiful Soup to pull dates from table

查看:46
本文介绍了使用美丽汤从表中提取日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑对已交付给州长的法案做些事情-收集交付日期和最后一次立法行动的日期.

I'm looking to do something with bills that have been delivered to the governor - collecting dates for when they were delivered and the date of the last legislative action before they were sent.

我正在针对一系列相似的URL进行此操作.问题是,我的代码(如下)适用于某些URL,而不适用于其他URL.我将其写入pandas数据框,然后再写入csv文件.当代码失败时,如果应触发 elif if 中的任何一个,则将写入 else 块.

I'm doing this for a whole series of similar URLs. Problem is, my code (below) works for some URLs and not others. I'm writing this to a pandas dataframe and then to csv file. When the code fails, it writes the else block when either if of elif should've been triggered.

这是失败的网址: https://www.nysenate.gov/legislation/bills/2011/s663 还有一个成功的网址: https://www.nysenate.gov/legislation/bills/2011/s333

Here's a fail URL: https://www.nysenate.gov/legislation/bills/2011/s663 And a succeed URL: https://www.nysenate.gov/legislation/bills/2011/s333

以第一个URL为例.在查看动作"下方,下拉菜单,它说它已于2011年7月29日交付给州长.在此之前,它已于2011年6月20日返回议会.

Take the first URL for example. Underneath the "view actions" dropdown, it says it was delivered to the governor on Jul 29, 2011. Prior to that, it was returned to assembly on Jun 20, 2011.

使用交付给州长"位置为表格中的td,我想使用Bs4收集两个日期.

Using "delivered to governor" location as td in the table, I'd like to collect both dates using Bs4.

这就是我的代码中的内容:

Here's what I have in my code:

check_list = [item.text.strip() for item in tablebody.select("td")]

dtg = "delivered to governor"
dtg_regex = re.compile(
    '/.*(\S\S\S\S\S\S\S\S\S\s\S\S\s\S\S\S\S\S\S\S\S).*'
)
        
if dtg in check_list:

    i = check_list.index(dtg)
    transfer_list.append(check_list[i+1]) ## last legislative action date (not counting dtg)
    transfer_list.append(check_list[i-1]) ## dtg date
            
elif any(dtg_regex.match(dtg_check_list) for dtg_check_list in check_list):
    transfer_list.append(check_list[4])
    transfer_list.append(check_list[2])
            
else:
    transfer_list.append("no floor vote")
    transfer_list.append("not delivered to governor")

推荐答案

您可以使用:has和:contains定位右侧的第一行,并使用find_next移至下一行.您可以使用last-of-type在第一行中获取最后一个操作select_one在第二行中获取第一.您可以使用每个列"的类在第一列和第二列之间移动.

You could use :has and :contains to target the right first row and find_next to move to next row. You can use last-of-type to get last action in first row select_one to get first in second row. You can use the class of each "column" to move between first and second columns.

您的里程可能会因其他页面而异.

Your mileage may vary with other pages.

import requests
from bs4 import BeautifulSoup as bs

links = ['https://www.nysenate.gov/legislation/bills/2011/s663', 'https://www.nysenate.gov/legislation/bills/2011/s333']
transfer_list = []

with requests.Session() as s:
    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        target = soup.select_one('.cbill--actions-table--row:has(td:contains("delivered"))')
        
        if target:
            print(target.select_one('.c-bill--actions-table-col1').text)
            # transfer_list.append(target.select_one('.c-bill--actions-table-col1').text)   
            print(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
            # transfer_list.append(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
            print(target.find_next('tr').select_one('.c-bill--actions-table-col1').text)
            # append again
            print(target.find_next('tr').select_one('.c-bill--actions-table-col2 span').text)
            # append again
        else:
            transfer_list.append("no floor vote")
            transfer_list.append("not delivered to governor")

这篇关于使用美丽汤从表中提取日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆