迭代并从网络中提取表格,在 Python 中另存为 excel 文件 [英] Iterate and extract tables from web saving as excel file in Python

查看:44
本文介绍了迭代并从网络中提取表格,在 Python 中另存为 excel 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从

到目前为止我的代码:

将pandas导入为pd进口请求从 bs4 导入 BeautifulSoup从表格导入表格url = 'http://zjj.sz.gov.cn/ztfw/gcjs/xmxx/jgysba/'res = requests.get(url)汤 = BeautifulSoup(res.content,'lxml')打印(汤)

新更新:

from requests 导入 post导入json将熊猫导入为 pd将 numpy 导入为 np标题 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",推荐人":http://zjj.sz.gov.cn/projreg/public/jgys/jgysList.jsp"}dfs = []#dfs = pd.DataFrame()对于范围(0, 10)中的页面:数据 = {"limit": 100, "offset": page * 100, "pageNumber": page + 1}json_arr = requests.post("http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json", headers = headers, data = data).textd = json.loads(json_arr)df = pd.read_json(json.dumps(d['rows']), orient='list')dfs.append(df)打印(dfs)dfs = pd.concat(dfs)#https://stackoverflow.com/questions/57842073/pandas-how-to-drop-rows-when-all-float-columns-are-nandfs = dfs.loc[:, ~dfs.replace(0, np.nan).isna().all()]dfs.to_excel('test.xlsx', index = False)

它生成 10 页和 1000 行,但有些列的值放错了位置,有人知道我哪里做错了吗?谢谢.

解决方案

因此,使用 XHR 中的 JSON API,您可以制作一个简单的python post 请求通过 requests 并且你有你的数据.

在参数中有两个,您可以更改它们以获取不同数量的数据,limit 是您在请求中获得的对象数量.pageNumber 是分页页计数器.

from requests 导入 post导入jsonurl = 'http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json'数据 = {'限制':'100','pageNumber':'1'}响应 = 帖子(网址,数据 = d)响应文本

此外,您可以根据需要使用 pandas 创建数据框或创建 excel.

I want to iterate and extract table from the link here, then save as excel file.

How can I do that? Thank you.

My code so far:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = 'http://zjj.sz.gov.cn/ztfw/gcjs/xmxx/jgysba/'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
print(soup)

New update:

from requests import post
import json
import pandas as pd
import numpy as np

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
        "Referer": "http://zjj.sz.gov.cn/projreg/public/jgys/jgysList.jsp"}
dfs = []
#dfs = pd.DataFrame()

for page in range(0, 10):
    data = {"limit": 100, "offset": page * 100, "pageNumber": page + 1}
    json_arr = requests.post("http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json", headers = headers, data = data).text
    d = json.loads(json_arr)
    df = pd.read_json(json.dumps(d['rows']) , orient='list')
    dfs.append(df)
    print(dfs)

dfs = pd.concat(dfs)
#https://stackoverflow.com/questions/57842073/pandas-how-to-drop-rows-when-all-float-columns-are-nan
dfs = dfs.loc[:, ~dfs.replace(0, np.nan).isna().all()]
dfs.to_excel('test.xlsx', index = False)

It generates 10 pages and 1000 rows, but some columns values are misplaced, someone knows where did I do wrong? Thank you.

解决方案

So, using the JSON API from XHR you make a simple python post request via requests and you have your data.

In the params you have two of them which you can change to get different volumes of data, limit is the nos of objects you get in a request. pageNumber is the paginated page counter.

from requests import post
import json

url = 'http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json'
data = { 'limit' : '100', 'pageNumber' : '1'}
response = post(url, data=d)
response.text

Further you can use pandas to create a data frame or create a excel as you want.

这篇关于迭代并从网络中提取表格,在 Python 中另存为 excel 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆