根据特定列标题提取HTML表-Python [英] Extract HTML Table Based on Specific Column Headers - Python

查看:154
本文介绍了根据特定列标题提取HTML表-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下例如,第44页的2019 Director薪酬表.我相信该表没有特定的ID,例如薪酬表"等.要提取该表,我只能想到匹配的列名或关键字,例如股票奖励"或所有其他补偿",然后获取相关的表格.

For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table.

是否有一种简单的方法可以根据列名提取这些表?还是更简单的方法?

Is there an easy way to extract these tables based on column names? Or maybe an easier way?

谢谢!

我在抓取HTML表方面相对较新.我的代码如下

I am relatively new at scraping HTML tables.. my code is as follows

from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser')
rows = soup.find_all('tr')

推荐答案

确定可以使用pandas read_html函数,并根据

Sure you can do that, using pandas read_html function using match and attrs according to documentation.

import pandas as pd

df = pd.read_html(
    "https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Non-Employee Directors")

print(df)

df[0].to_csv("data.csv", index=False, header=False)

输出:在线查看

这篇关于根据特定列标题提取HTML表-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆