循环页面并使用Python从网站将内容保存在Excel文件中 [英] Loop pages and save contents in Excel file from website in Python

查看:73
本文介绍了循环页面并使用Python从网站将内容保存在Excel文件中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此

这是我尝试过的:

  url ='http://so.eastmoney.com/Ann/s?keyword=购买物业& pageindex = {}'对于范围(10)中的页面:r = request.get(url.format(page))汤= BeautifulSoup(r.content,"html.parser")打印(汤) 

每个元素的xpath(对于不懂中文的人可能会有所帮助):

 /html/body/div [3]/div/div [2]/div [2]/div [3]/h3/span->【润华物业】/html/body/div [3]/div/div [2]/div [2]/div [3]/h3/a->润华物业:关于公司购买理财产品的公告/html/body/div [3]/div/div [2]/div [2]/div [3]/p/label->2017-04-24/html/body/div [3]/div/div [2]/div [2]/div [3]/p/span->公告编号:2017-019证券代码:836007证券简称:润华物业主办券商:国联证券/html/body/div [3]/div/div [2]/div [2]/div [3]/a->http://data.eastmoney.com/notices/detail/836007/AN201704250530124271,JWU2JWI2JWE2JWU1JThkJThlJWU3JTg5JWE5JWU0JWI4JTlh.html 

我需要将输出保存到Excel文件中.如何在Python中做到这一点?非常感谢.

解决方案

BeautifulSoup 不会看到这些东西,因为它是由 JS 动态呈现的,但是有一个API您可以查询端点以获取所需信息.

方法如下:

 导入请求将熊猫作为pd导入def clean_up(text:str)->str:返回text.replace('</em>',)).replace(':< em>','').replace('< em>','')def get_data(page_number:int)->字典:url = f"http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={page_number}&pagesize=10&keyword=购买物业& name = normal"标头= {引荐来源网址":f"http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={page_number}","User-Agent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10.15; rv:83.0)Gecko/20100101 Firefox/83.0",}返回request.get(URL,headers = headers).json()def parse_response(response:dict)->列表:用于响应中的项目[数据"]:标题= clean_up(item ['NoticeTitle'])日期= item ['NoticeDate']url = item ['Url']notice_content = clean_up(" .join(item ['NoticeContent'].split()))company_name = item ['SecurityFullName']打印(f" {company_name}-{title}-{date}")产量[标题,URL,日期,公司名称,notice_content]def save_results(parsed_response:list):df = pd.DataFrame(parsed_response,columns = ['title','url','date','company_name','content'],)df.to_excel("test_output.xlsx",index = False)如果__name__ =="__main__":输出= []对于范围(1,11)中的页面:对于parse_response(get_data(page))中的parsed_row:output.append(parsed_row)save_results(输出) 

这将输出:

 栖霞物业购买资产的公告-2019-09-03 16:00:00-871792索克物业购买资产的公告-2020-08-17 00:00:00-832816中都物业购买产权的公告-2019-12-09 16:00:00-872955开元物业:开元物业购买银行理财产品的公告-2015-05-21 16:00:00-831971开元物业:开元物业购买银行理财产品的公告-2015-04-12 16:00:00-831971盛全物业:拟购买财产的公告-2017-10-30 16:00:00-834070润华物业购买资产暨关联交易公告-2016-08-23 16:00:00-836007润华物业购买资产暨关联交易公告-2017-08-14 16:00:00-836007萃华珠宝:关于拟购买物业并签署购买意向协议的公告-2017-07-10 16:00:00-002731赛意信息:关于购买办公物业的公告-2020-12-02 00:00:00-300687 

并将其保存到一个 .csv 文件中,该文件可以由excel轻松处理.

PS.我不懂中文(?),所以您必须调查回复内容并挑选更多内容.

I'm trying to loop pages from this link and extract the interesting part.

Please see the contents in the red circle in the image below.

Here's what I've tried:

url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}'
for page in range(10):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    print(soup)

xpath for each element (might be helpful for those that don't read Chinese):

/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span  --> 【润华物业】
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/a --> 润华物业:关于公司购买理财产品的公告
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/label --> 2017-04-24
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/span --> 公告编号:2017-019 证券代码:836007 证券简称:润华物业  主办券商:国联证券
/html/body/div[3]/div/div[2]/div[2]/div[3]/a --> http://data.eastmoney.com/notices/detail/836007/AN201704250530124271,JWU2JWI2JWE2JWU1JThkJThlJWU3JTg5JWE5JWU0JWI4JTlh.html

I need to save the output to an Excel file. How could I do that in Python? Many thanks.

解决方案

BeautifulSoup won't see this stuff, as it's rendered dynamically by JS, but there's an API endpoint you can query to get what you're after.

Here's how:

import requests
import pandas as pd


def clean_up(text: str) -> str:
    return text.replace('</em>', '').replace(':<em>', '').replace('<em>', '')


def get_data(page_number: int) -> dict:
    url = f"http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={page_number}&pagesize=10&keyword=购买物业&name=normal"
    headers = {
        "Referer": f"http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={page_number}",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
    }
    return requests.get(url, headers=headers).json()


def parse_response(response: dict) -> list:
    for item in response["Data"]:
        title = clean_up(item['NoticeTitle'])
        date = item['NoticeDate']
        url = item['Url']
        notice_content = clean_up(" ".join(item['NoticeContent'].split()))
        company_name = item['SecurityFullName']
        print(f"{company_name} - {title} - {date}")
        yield [title, url, date, company_name, notice_content]


def save_results(parsed_response: list):
    df = pd.DataFrame(
        parsed_response,
        columns=['title', 'url', 'date', 'company_name', 'content'],
    )
    df.to_excel("test_output.xlsx", index=False)


if __name__ == "__main__":
    output = []
    for page in range(1, 11):
        for parsed_row in parse_response(get_data(page)):
            output.append(parsed_row)

    save_results(output)

This outputs:

栖霞物业购买资产的公告 - 2019-09-03 16:00:00 - 871792
索克物业购买资产的公告 - 2020-08-17 00:00:00 - 832816
中都物业购买股权的公告 - 2019-12-09 16:00:00 - 872955
开元物业:开元物业购买银行理财产品的公告 - 2015-05-21 16:00:00 - 831971
开元物业:开元物业购买银行理财产品的公告 - 2015-04-12 16:00:00 - 831971
盛全物业:拟购买房产的公告 - 2017-10-30 16:00:00 - 834070
润华物业购买资产暨关联交易公告 - 2016-08-23 16:00:00 - 836007
润华物业购买资产暨关联交易公告 - 2017-08-14 16:00:00 - 836007
萃华珠宝:关于拟购买物业并签署购买意向协议的公告 - 2017-07-10 16:00:00 - 002731
赛意信息:关于购买办公物业的公告 - 2020-12-02 00:00:00 - 300687

And saves this to a .csv file that can be easily handled by excel.

PS. I don't know Chinese (?) so you'd have to look into the response contents and pick more stuff out.

这篇关于循环页面并使用Python从网站将内容保存在Excel文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆