循环页面并将详细内容另存为Python中的数据框 [英] Loop pages and save detailed contents as dataframe in Python
本文介绍了循环页面并将详细内容另存为Python中的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
说我需要从
目标是从
更新,用于将 dfs
附加为数据框:
updated_df = pd.DataFrame()使用request.Session()作为connection_session:#重用您的连接!对于get_follow_urls(get_main_urls(),connection_session)中的follow_url:键= follow_url.rsplit("/")[-1] .replace(.html",")#print(f正在获取{key} ...的数据...")dfs = pd.read_html(connection_session.get(follow_url).content.decode("utf-8"),flavour ="bs4",)#https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe对于dfs中的df:df = dfs [0] .T.iloc [1 :,:].copy()Updated_df = Updated_df.append(df)打印(updated_df)cols = ['项目编号','转让/出租标的名称','转让方/出租方名称','转让标的评估价/年租金评估价(元)','转让底价/年租金底价(元)",受让方/承租方名称",成交价/成交年租金(元)",成交日期"]Updated_df.columns = colsUpdated_df.to_excel('./data.xlsx',index = False)
解决方案
这是我的处理方式:
- 构建所有
主网址
- 访问每个
主页
- 获取
以下网址
- 访问每个
关注网址
- 从
跟随url
中获取表格 - 使用
pandas
解析表 - 将表格添加到
pandas
个数据框的字典中 - 处理表格(不包含->实施您的逻辑)
重复 2-7
步骤以继续抓取数据.
代码:
将熊猫作为pd导入汇入要求从bs4导入BeautifulSoupBASE_URL ="http://www.jscq.com.cn/dsf/zc/cjgg"def get_main_urls()->列表:start_url = f"{BASE_URL}/index.html"返回[start_url] + [f"{BASE_URL}/index_ {i} .html"对于范围内的我(1,6)]def get_follow_urls(urls:list,session:requests.Session())->重复:网址[:1]中的网址:#删除[:1]以抓取所有页面正文= session.get(url).contents = BeautifulSoup(body,"lxml").find_all("td",{"width":"60%"})从[f"{BASE_URL} {a.find('a')['href'] [1:]}的结果中得出的结果对于in s]dataframe_collection = {}使用request.Session()作为connection_session:#重用您的连接!对于get_follow_urls(get_main_urls(),connection_session)中的follow_url:键= follow_url.rsplit("/")[-1] .replace(.html",")print(f正在获取{key}的数据...")df = pd.read_html(connection_session.get(follow_url).content.decode("utf-8"),风味="bs4",)dataframe_collection [key] = df#在这里处理dataframe_collection#打印数据框字典(可选,可以删除)对于dataframe_collection.keys()中的键:print("\ n" +"=" * 40)打印(键)打印(-" * 40)打印(dataframe_collection [键])
输出:
正在获取t20210311_30347的数据...正在获取t20210311_30346的数据...正在获取t20210305_30338的数据...正在获取t20210305_30337的数据...正在获取t20210303_30323的数据...正在获取t20210225_30306的数据...正在获取t20210225_30305的数据...正在获取t20210225_30304的数据...正在获取t20210225_30303的数据...正在获取t20210209_30231的数据...然后 ...
Say I need to crawler the detailed contents from this link:
The objective is to extract contents the elements from the link, and append all the entries as dataframe.
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'http://www.jscq.com.cn/dsf/zc/cjgg/202101/t20210126_30144.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
Out:
南京市玄武区锁金村10-30号房屋公开招租成交公告-成交公告-江苏产权市场
body{font-size:100%!important;}
.main_body{position:relative;width:1000px;margin:0 auto;background-color:#fff;}
.main_content_p img{max-width:90%;display:block;margin:0 auto;}
.m_con_r_h{padding-left: 20px;width: 958px;height: 54px;line-height: 55px;font-size: 12px;color: #979797;}
.m_con_r_h a{color: #979797;}
.main_content_p{min-height:200px;width:90%;margin:0 auto;line-height: 30px;text-indent:0;}
.main_content_p table{margin:0 auto!important;width:900px!important;}
.main_content_h1{border:none;width:93%;margin:0 auto;}
.tit_h{font-size:22px;font-family:'微软雅黑';color:#000;line-height:30px;margin-bottom:10px;padding-bottom:20px;text-align:center;}
.doc_time{font-size:12px;color:#555050;height:28px;line-height:28px;text-align:center;background:#F2F7FD;border-top:1px solid #dadada;}
.doc_time span{padding:0 5px;}
.up_dw{width:100%;border-top:1px solid #ccc;padding-top:10px;padding-bottom:10px;margin-top:30px;clear:both;}
.pager{width:50%;float:left;padding-left:0;text-align:center;}
.bshare-custom{position:absolute;top:20px;right:40px;}
.pager{width:90%;padding-left: 50px;float:inherit;text-align: inherit;}
页头部分开始
页头部分结束
START body
南京市玄武区锁金村10-30号房屋公开招租成交公告
组织机构:江苏省产权交易所
发布时间:2021-01-26
项目编号
17FCZZ20200125
转让/出租标的名称
南京市玄武区锁金村10-30号房屋公开招租
转让方/出租方名称
南京邮电大学资产经营有限责任公司
转让标的评估价/年租金评估价(元)
64800.00
转让底价/年租金底价(元)
97200.00
受让方/承租方名称
马尕西木
成交价/成交年租金(元)
97200.00
成交日期
2021年01月15日
附件:
END body
页头部分开始
页头部分结束
But how could I loop all the pages and extract contents, and append them to the following dataframe? Thanks.
Updates for appending dfs
as a dataframe:
updated_df = pd.DataFrame()
with requests.Session() as connection_session: # reuse your connection!
for follow_url in get_follow_urls(get_main_urls(), connection_session):
key = follow_url.rsplit("/")[-1].replace(".html", "")
# print(f"Fetching data for {key}...")
dfs = pd.read_html(
connection_session.get(follow_url).content.decode("utf-8"),
flavor="bs4",
)
# https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
for df in dfs:
df = dfs[0].T.iloc[1:, :].copy()
updated_df = updated_df.append(df)
print(updated_df)
cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价(元)',
'转让底价/年租金底价(元)', '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期']
updated_df.columns = cols
updated_df.to_excel('./data.xlsx', index = False)
解决方案
Here's how I would do this:
- build all
main urls
- visit every
main page
- get the
follow urls
- visit each
follow url
- grab the table from the
follow url
- parse the table with
pandas
- add the table to a dictionary of
pandas
dataframes - process the tables (not included -> implement your logic)
repeat the 2 - 7
steps to continue scraping the data.
The code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"
def get_main_urls() -> list:
start_url = f"{BASE_URL}/index.html"
return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]
def get_follow_urls(urls: list, session: requests.Session()) -> iter:
for url in urls[:1]: # remove [:1] to scrape all the pages
body = session.get(url).content
s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]
dataframe_collection = {}
with requests.Session() as connection_session: # reuse your connection!
for follow_url in get_follow_urls(get_main_urls(), connection_session):
key = follow_url.rsplit("/")[-1].replace(".html", "")
print(f"Fetching data for {key}...")
df = pd.read_html(
connection_session.get(follow_url).content.decode("utf-8"),
flavor="bs4",
)
dataframe_collection[key] = df
# process the dataframe_collection here
# print the dictionary of dataframes (optional and can be removed)
for key in dataframe_collection.keys():
print("\n" + "=" * 40)
print(key)
print("-" * 40)
print(dataframe_collection[key])
Output:
Fetching data for t20210311_30347...
Fetching data for t20210311_30346...
Fetching data for t20210305_30338...
Fetching data for t20210305_30337...
Fetching data for t20210303_30323...
Fetching data for t20210225_30306...
Fetching data for t20210225_30305...
Fetching data for t20210225_30304...
Fetching data for t20210225_30303...
Fetching data for t20210209_30231...
and then ...
这篇关于循环页面并将详细内容另存为Python中的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文