通过使用 python 更改 ID 从 HTML 页面读取表格 [英] Read tables from HTML page by changing the ID using python
问题描述
我正在使用下面的html链接来阅读页面中的表格:
I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664
The last part of the link(allbin)
is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?
Output Like this,
ID Number Type FileDate
2016664 NB 14581-26 New Building 12/21/2020
4257909 NB 1481-29 New Building 3/6/2021
4138920 NB 481-29 New Building 9/4/2020
List of other ID for use:
['4257909', '4138920', '4533715']
This was my attempt, I can read a single table with this code.
import requests
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
df
To get all pages from list of IDs you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"
def get_info(ID, page=1):
out = []
while True:
try:
print("ID: {} Page: {}".format(ID, page))
t = requests.get(url.format(ID, page), timeout=1).text
df = pd.read_html(StringIO(t))[3].loc[1:, :]
if len(df) == 0:
break
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
df["ID"] = ID
out.append(df)
page += 25
except requests.exceptions.ReadTimeout:
print("Timeout...")
continue
return out
list_of_ids = [2016664, 4257909, 4138920, 4533715]
dfs = []
for ID in list_of_ids:
dfs.extend(get_info(ID))
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)
Prints:
NUMBER NUMBER TYPE FILE DATE ID
1 ALT 1469-1890 NaN ALTERATION 00/00/0000 2016664
2 ALT 1313-1874 NaN ALTERATION 00/00/0000 2016664
3 BN 332-1938 NaN BUILDING NOTICE 00/00/0000 2016664
4 BN 636-1916 NaN BUILDING NOTICE 00/00/0000 2016664
5 CO NB 1295-1923 (PDF) CERTIFICATE OF OCCUPANCY 00/00/0000 2016664
...
And saves data.csv
(screenshot from LibreOffice):
这篇关于通过使用 python 更改 ID 从 HTML 页面读取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!