从具有不同格式的不同来源提取具有相似数据的HTML表-Python [英] Extract HTML Tables With Similar Data from Different Sources with Different Formatting - Python
问题描述
我正在尝试从两个不同的HTML来源中抓取HTML表.两者非常相似,每个表都包含相同的数据,但是它们的结构可能不同,列名也不同.对于一个源,所有数据都可能包含在一个表中,而另一个源可能会将数据分解为一个表两个单独的表.
I am trying to scrape HTML tables from two different HTML sources. Both are very similar, each table includes the same data but they may be structured differently, with different column names etc. For one source, all of the data may be included in one table, while the other source may have the data broken up into two separate tables.
例如,我们可以查看AAPL和MMM股票的内部持有人.
As an example, we can look at insider holders of both AAPL and MMM stocks.
此处的屏幕截图- https://imgur.com/a/OihTSZR
可以说,最终目标是提取内部人员所持股份的总数-一个单数.每个表的结构可能不同,但应该相似的关键词是诸如有益地"或股票"之类的词.
Lets say the end goal is to extract the total number of shares held by insiders - one singular number. Each table may be structured differently, but what should be similar is key words such as "Beneficially" or "Stock".
任何帮助将不胜感激.在上一篇文章中,我能够提取一些数据.但是如果结构不同,就不能循环或重复.
Any help would be greatly appreciated. In a previous post I was able to extract some of the data. But it can't be looped or repeated if structuring is different.
df = pd.read_html("https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Name/address")
df = df[0]
df = df.dropna(axis = 'columns')
也尝试过BS
url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')
rows = tables.find_all('tr')
推荐答案
那真的很复杂,但是我们开始:).
That was really complicated but here we go :).
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
urls = ['https://www.sec.gov/Archives/edgar/data/320193/000119312520001450/d799303ddef14a.htm',
'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm']
def main(urls):
with requests.Session() as req:
for url in urls:
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for item in soup.findAll("a", text=re.compile("^Security")):
item = item.get("href")[1:]
catch = soup.find("a", {'name': item}).find_next("table")
df = pd.read_html(str(catch))
print(df)
df[0].to_csv(f"{item}.csv", index=False, header=None)
main(urls)
输出:
[ 0 ... 8
0 NaN ... NaN
1 NaN ... NaN
2 Name of Beneficial Owner ... NaN
3 NaN ... NaN
4 The Vanguard Group ... %
5 NaN ... NaN
6 BlackRock, Inc. ... %
7 NaN ... NaN
8 Berkshire Hathaway Inc. / Warren E. Buffett ... %
9 NaN ... NaN
10 Kate Adams ... NaN
11 NaN ... NaN
12 Angela Ahrendts ... NaN
13 NaN ... NaN
14 James Bell ... NaN
15 NaN ... NaN
16 Tim Cook ... NaN
17 NaN ... NaN
18 Al Gore ... NaN
19 NaN ... NaN
20 Andrea Jung ... NaN
21 NaN ... NaN
22 Art Levinson ... NaN
23 NaN ... NaN
24 Luca Maestri ... NaN
25 NaN ... NaN
26 Deirdre O’Brien ... NaN
27 NaN ... NaN
28 Ron Sugar ... NaN
29 NaN ... NaN
30 Sue Wagner ... NaN
31 NaN ... NaN
32 Jeff Williams ... NaN
33 NaN ... NaN
34 All current executive officers and directors a... ... NaN
[35 rows x 9 columns]]
[ 0 1 ... 18 19
0 Name and principal position NaN ... Percent of Class NaN
1 Thomas "Tony" K. Brown, Director NaN ... (5) NaN
2 Pamela J. Craig, Director NaN ... (5) NaN
3 David B. Dillon, Director NaN ... (5) NaN
4 Michael L. Eskew, Director NaN ... (5) NaN
5 Herbert L. Henkel, Director NaN ... (5) NaN
6 Amy E. Hood, Director NaN ... (5) NaN
7 Muhtar Kent, Director NaN ... (5) NaN
8 Edward M. Liddy, Director NaN ... (5) NaN
9 Dambisa F. Moyo, Director NaN ... (5) NaN
10 Gregory R. Page, Director NaN ... (5) NaN
11 Patricia A. Woertz, Director NaN ... (5) NaN
12 Michael F. Roman, Chairman of the Board, Presi... NaN ... (5) NaN
13 Inge G. Thulin, Former Executive Chairman of t... NaN ... (5) NaN
14 Nicholas C. Gangestad, Senior Vice President a... NaN ... (5) NaN
15 Ashish K. Khandpur, Executive Vice President, ... NaN ... (5) NaN
16 Julie L. Bushman, Executive Vice President, In... NaN ... (5) NaN
17 Joaquin Delgado, Former Executive Vice Preside... NaN ... (5) NaN
18 Michael G. Vale, Executive Vice President, Saf... NaN ... (5) NaN
19 All Directors and Executive Officers as a Grou... NaN ... (5) NaN
[20 rows x 20 columns]]
[ 0 1 ... 6 7
0 Name/address NaN ... Percent of Class NaN
1 The Vanguard Group(1) 100 Vanguard Blvd. Malve... NaN ... 8.78 NaN
2 State Street Corporation(2) State Street Finan... NaN ... 7.36 NaN
3 BlackRock, Inc.(3) 55 East 52nd Street New Yor... NaN ... 7.30 NaN
[4 rows x 8 columns]]
这篇关于从具有不同格式的不同来源提取具有相似数据的HTML表-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!