通过使用 python 更改 ID 从 HTML 页面读取表格 [英] Read tables from HTML page by changing the ID using python

查看:30
本文介绍了通过使用 python 更改 ID 从 HTML 页面读取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用下面的html链接来阅读页面中的表格:

I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664

The last part of the link(allbin) is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?

Output Like this,

ID         Number         Type             FileDate
2016664   NB 14581-26     New Building    12/21/2020
4257909   NB 1481-29      New Building    3/6/2021
4138920   NB 481-29       New Building    9/4/2020

List of other ID for use:

['4257909', '4138920', '4533715']

This was my attempt, I can read a single table with this code.

import requests
import pandas as pd

url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
    
df 

解决方案

To get all pages from list of IDs you can use next example:

import requests
import pandas as pd
from io import StringIO

url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"


def get_info(ID, page=1):
    out = []
    while True:
        try:
            print("ID: {} Page: {}".format(ID, page))
            t = requests.get(url.format(ID, page), timeout=1).text
            df = pd.read_html(StringIO(t))[3].loc[1:, :]
            if len(df) == 0:
                break
            df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
            df["ID"] = ID
            out.append(df)
            page += 25
        except requests.exceptions.ReadTimeout:
            print("Timeout...")
            continue
    return out


list_of_ids = [2016664, 4257909, 4138920, 4533715]

dfs = []
for ID in list_of_ids:
    dfs.extend(get_info(ID))

df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)

Prints:

                                                                              NUMBER                                                                            NUMBER                                                                              TYPE                                                                         FILE DATE       ID
1                                                                      ALT 1469-1890                                                                               NaN                                                                        ALTERATION                                                                        00/00/0000  2016664
2                                                                      ALT 1313-1874                                                                               NaN                                                                        ALTERATION                                                                        00/00/0000  2016664
3                                                                        BN 332-1938                                                                               NaN                                                                   BUILDING NOTICE                                                                        00/00/0000  2016664
4                                                                        BN 636-1916                                                                               NaN                                                                   BUILDING NOTICE                                                                        00/00/0000  2016664
5                                                                    CO NB 1295-1923                                                                             (PDF)                                                          CERTIFICATE OF OCCUPANCY                                                                        00/00/0000  2016664

...

And saves data.csv (screenshot from LibreOffice):

这篇关于通过使用 python 更改 ID 从 HTML 页面读取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆