迭代并从网络中提取表格,在 Python 中另存为 excel 文件 [英] Iterate and extract tables from web saving as excel file in Python
问题描述
我想从
到目前为止我的代码:
将pandas导入为pd进口请求从 bs4 导入 BeautifulSoup从表格导入表格url = 'http://zjj.sz.gov.cn/ztfw/gcjs/xmxx/jgysba/'res = requests.get(url)汤 = BeautifulSoup(res.content,'lxml')打印(汤)
新更新:
from requests 导入 post导入json将熊猫导入为 pd将 numpy 导入为 np标题 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",推荐人":http://zjj.sz.gov.cn/projreg/public/jgys/jgysList.jsp"}dfs = []#dfs = pd.DataFrame()对于范围(0, 10)中的页面:数据 = {"limit": 100, "offset": page * 100, "pageNumber": page + 1}json_arr = requests.post("http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json", headers = headers, data = data).textd = json.loads(json_arr)df = pd.read_json(json.dumps(d['rows']), orient='list')dfs.append(df)打印(dfs)dfs = pd.concat(dfs)#https://stackoverflow.com/questions/57842073/pandas-how-to-drop-rows-when-all-float-columns-are-nandfs = dfs.loc[:, ~dfs.replace(0, np.nan).isna().all()]dfs.to_excel('test.xlsx', index = False)
它生成 10
页和 1000
行,但有些列的值放错了位置,有人知道我哪里做错了吗?谢谢.
因此,使用 XHR 中的 JSON
API,您可以制作一个简单的python post
请求通过 requests
并且你有你的数据.
在参数中有两个,您可以更改它们以获取不同数量的数据,limit
是您在请求中获得的对象数量.pageNumber
是分页页计数器.
from requests 导入 post导入jsonurl = 'http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json'数据 = {'限制':'100','pageNumber':'1'}响应 = 帖子(网址,数据 = d)响应文本
此外,您可以根据需要使用 pandas
创建数据框或创建 excel.
I want to iterate and extract table from the link here, then save as excel file.
How can I do that? Thank you.
My code so far:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = 'http://zjj.sz.gov.cn/ztfw/gcjs/xmxx/jgysba/'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
print(soup)
New update:
from requests import post
import json
import pandas as pd
import numpy as np
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
"Referer": "http://zjj.sz.gov.cn/projreg/public/jgys/jgysList.jsp"}
dfs = []
#dfs = pd.DataFrame()
for page in range(0, 10):
data = {"limit": 100, "offset": page * 100, "pageNumber": page + 1}
json_arr = requests.post("http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json", headers = headers, data = data).text
d = json.loads(json_arr)
df = pd.read_json(json.dumps(d['rows']) , orient='list')
dfs.append(df)
print(dfs)
dfs = pd.concat(dfs)
#https://stackoverflow.com/questions/57842073/pandas-how-to-drop-rows-when-all-float-columns-are-nan
dfs = dfs.loc[:, ~dfs.replace(0, np.nan).isna().all()]
dfs.to_excel('test.xlsx', index = False)
It generates 10
pages and 1000
rows, but some columns values are misplaced, someone knows where did I do wrong? Thank you.
So, using the JSON
API from XHR you make a simple python post
request via requests
and you have your data.
In the params you have two of them which you can change to get different volumes of data, limit
is the nos of objects you get in a request. pageNumber
is the paginated page counter.
from requests import post
import json
url = 'http://zjj.sz.gov.cn/projreg/public/jgys/webService/getJgysLogList.json'
data = { 'limit' : '100', 'pageNumber' : '1'}
response = post(url, data=d)
response.text
Further you can use pandas
to create a data frame or create a excel as you want.
这篇关于迭代并从网络中提取表格,在 Python 中另存为 excel 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!