根据特定列标题提取HTML表-Python [英] Extract HTML Table Based on Specific Column Headers - Python
问题描述
我正在尝试从以下例如,第44页的2019 Director薪酬表.我相信该表没有特定的ID,例如薪酬表"等.要提取该表,我只能想到匹配的列名或关键字,例如股票奖励"或所有其他补偿",然后获取相关的表格.
For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table.
是否有一种简单的方法可以根据列名提取这些表?还是更简单的方法?
Is there an easy way to extract these tables based on column names? Or maybe an easier way?
谢谢!
我在抓取HTML表方面相对较新.我的代码如下
I am relatively new at scraping HTML tables.. my code is as follows
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
rows = soup.find_all('tr')
推荐答案
Sure you can do that, using pandas
read_html
function using match
and attrs
according to documentation.
import pandas as pd
df = pd.read_html(
"https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Non-Employee Directors")
print(df)
df[0].to_csv("data.csv", index=False, header=False)
输出:在线查看
这篇关于根据特定列标题提取HTML表-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!