循环页面并将详细内容另存为Python中的数据框 [英] Loop pages and save detailed contents as dataframe in Python

查看:63
本文介绍了循环页面并将详细内容另存为Python中的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我需要从

目标是从

更新,用于将 dfs 附加为数据框:

  updated_df = pd.DataFrame()使用request.Session()作为connection_session:#重用您的连接!对于get_follow_urls(get_main_urls(),connection_session)中的follow_url:键= follow_url.rsplit("/")[-1] .replace(.html",")#print(f正在获取{key} ...的数据...")dfs = pd.read_html(connection_session.get(follow_url).content.decode("utf-8"),flavour ="bs4",)#https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe对于dfs中的df:df = dfs [0] .T.iloc [1 :,:].copy()Updated_df = Updated_df.append(df)打印(updated_df)cols = ['项目编号','转让/出租标的名称','转让方/出租方名称','转让标的评估价/年租金评估价(元)','转让底价/年租金底价(元)",受让方/承租方名称",成交价/成交年租金(元)",成交日期"]Updated_df.columns = colsUpdated_df.to_excel('./data.xlsx',index = False) 

解决方案

这是我的处理方式:

  1. 构建所有主网址
  2. 访问每个主页
  3. 获取以下网址
  4. 访问每个关注网址
  5. 跟随url
  6. 中获取表格
  7. 使用 pandas
  8. 解析表
  9. 将表格添加到 pandas 个数据框的字典中
  10. 处理表格(不包含->实施您的逻辑)

重复 2-7 步骤以继续抓取数据.

代码:

 将熊猫作为pd导入汇入要求从bs4导入BeautifulSoupBASE_URL ="http://www.jscq.com.cn/dsf/zc/cjgg"def get_main_urls()->列表:start_url = f"{BASE_URL}/index.html"返回[start_url] + [f"{BASE_URL}/index_ {i} .html"对于范围内的我(1,6)]def get_follow_urls(urls:list,session:requests.Session())->重复:网址[:1]中的网址:#删除[:1]以抓取所有页面正文= session.get(url).contents = BeautifulSoup(body,"lxml").find_all("td",{"width":"60%"})从[f"{BASE_URL} {a.find('a')['href'] [1:]}的结果中得出的结果对于in s]dataframe_collection = {}使用request.Session()作为connection_session:#重用您的连接!对于get_follow_urls(get_main_urls(),connection_session)中的follow_url:键= follow_url.rsplit("/")[-1] .replace(.html",")print(f正在获取{key}的数据...")df = pd.read_html(connection_session.get(follow_url).content.decode("utf-8"),风味="bs4",)dataframe_collection [key] = df#在这里处理dataframe_collection#打印数据框字典(可选,可以删除)对于dataframe_collection.keys()中的键:print("\ n" +"=" * 40)打印(键)打印(-" * 40)打印(dataframe_collection [键]) 

输出:

 正在获取t20210311_30347的数据...正在获取t20210311_30346的数据...正在获取t20210305_30338的数据...正在获取t20210305_30337的数据...正在获取t20210303_30323的数据...正在获取t20210225_30306的数据...正在获取t20210225_30305的数据...正在获取t20210225_30304的数据...正在获取t20210225_30303的数据...正在获取t20210209_30231的数据...然后 ... 

Say I need to crawler the detailed contents from this link:

The objective is to extract contents the elements from the link, and append all the entries as dataframe.

from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'http://www.jscq.com.cn/dsf/zc/cjgg/202101/t20210126_30144.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script'
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)
print(output)

Out:

南京市玄武区锁金村10-30号房屋公开招租成交公告-成交公告-江苏产权市场 
body{font-size:100%!important;}
.main_body{position:relative;width:1000px;margin:0 auto;background-color:#fff;}
.main_content_p img{max-width:90%;display:block;margin:0 auto;}
.m_con_r_h{padding-left: 20px;width: 958px;height: 54px;line-height: 55px;font-size: 12px;color: #979797;}
.m_con_r_h a{color: #979797;}
.main_content_p{min-height:200px;width:90%;margin:0 auto;line-height: 30px;text-indent:0;}
.main_content_p table{margin:0 auto!important;width:900px!important;}
.main_content_h1{border:none;width:93%;margin:0 auto;}
.tit_h{font-size:22px;font-family:'微软雅黑';color:#000;line-height:30px;margin-bottom:10px;padding-bottom:20px;text-align:center;}
.doc_time{font-size:12px;color:#555050;height:28px;line-height:28px;text-align:center;background:#F2F7FD;border-top:1px solid #dadada;}
.doc_time span{padding:0 5px;}
.up_dw{width:100%;border-top:1px solid #ccc;padding-top:10px;padding-bottom:10px;margin-top:30px;clear:both;}
.pager{width:50%;float:left;padding-left:0;text-align:center;}

.bshare-custom{position:absolute;top:20px;right:40px;}
.pager{width:90%;padding-left: 50px;float:inherit;text-align: inherit;}
 页头部分开始 
 页头部分结束 
  START body  
 南京市玄武区锁金村10-30号房屋公开招租成交公告 
 组织机构:江苏省产权交易所 
 发布时间:2021-01-26  
 项目编号 
 17FCZZ20200125 
 转让/出租标的名称 
 南京市玄武区锁金村10-30号房屋公开招租 
 转让方/出租方名称 
 南京邮电大学资产经营有限责任公司 
 转让标的评估价/年租金评估价(元) 
 64800.00 
 转让底价/年租金底价(元) 
 97200.00 
 受让方/承租方名称 
 马尕西木 
 成交价/成交年租金(元) 
 97200.00 
 成交日期 
 2021年01月15日 
 附件: 
  END body  
 页头部分开始 
 页头部分结束 

But how could I loop all the pages and extract contents, and append them to the following dataframe? Thanks.

Updates for appending dfs as a dataframe:

updated_df = pd.DataFrame()

with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        # print(f"Fetching data for {key}...")
        dfs = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4",
        )
        # https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
        for df in dfs:
            df = dfs[0].T.iloc[1:, :].copy()
            updated_df = updated_df.append(df)
            print(updated_df)

cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价(元)', 
        '转让底价/年租金底价(元)', '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期']
updated_df.columns = cols
updated_df.to_excel('./data.xlsx', index = False)

解决方案

Here's how I would do this:

  1. build all main urls
  2. visit every main page
  3. get the follow urls
  4. visit each follow url
  5. grab the table from the follow url
  6. parse the table with pandas
  7. add the table to a dictionary of pandas dataframes
  8. process the tables (not included -> implement your logic)

repeat the 2 - 7 steps to continue scraping the data.

The code:

import pandas as pd
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"


def get_main_urls() -> list:
    start_url = f"{BASE_URL}/index.html"
    return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]


def get_follow_urls(urls: list, session: requests.Session()) -> iter:
    for url in urls[:1]:  # remove [:1] to scrape all the pages
        body = session.get(url).content
        s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
        yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]


dataframe_collection = {}

with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        print(f"Fetching data for {key}...")
        df = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4",
        )
        dataframe_collection[key] = df

    # process the dataframe_collection here

# print the dictionary of dataframes (optional and can be removed)
for key in dataframe_collection.keys():
    print("\n" + "=" * 40)
    print(key)
    print("-" * 40)
    print(dataframe_collection[key])

Output:

Fetching data for t20210311_30347...
Fetching data for t20210311_30346...
Fetching data for t20210305_30338...
Fetching data for t20210305_30337...
Fetching data for t20210303_30323...
Fetching data for t20210225_30306...
Fetching data for t20210225_30305...
Fetching data for t20210225_30304...
Fetching data for t20210225_30303...
Fetching data for t20210209_30231...

and then ...

这篇关于循环页面并将详细内容另存为Python中的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆