从网页中提取内容并保存为Python中的数据框 [英] Extract content from web pages and save as dataframe in Python

查看:84
本文介绍了从网页中提取内容并保存为Python中的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从的代码:

 导入请求从bs4导入BeautifulSoupurl ='https://www.cspea.com.cn/list/c01/gr2020bj1005297-3'res = requests.get(URL,verify = False)html_page = res.content汤= BeautifulSoup(html_page,'html.parser')文字=汤.find_all(文字=真)输出=''黑名单= ['[文档]','一种','b','身体','div','em','h1','h2','h3','头','html','一世',元",'p','脚本',# '跨度',#'td',#'th',# '标题'#可能会有更多您不想要的元素,例如样式"等.]对于文本中的t:如果t.parent.name不在黑名单中:输出+ ='{}'.format(t)打印(输出) 

如何提取数据并将内容另存为数据框?

解决方案

您可以使用此示例作为抓取页面的基础(因为我不知道中文,所以我将所有单元格都放入了数据框-您可以从中删除行之后不需要的数据框):

 导入urllib3urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)汇入要求从bs4导入BeautifulSoup将熊猫作为pd导入url ="https://www.cspea.com.cn/list/c01/gr2021bj1000186"汤= BeautifulSoup(requests.get(URL,verify = False).content,"html.parser")索引,数据= [],[]对于汤中的th.select(.project-detail-left th"):h = th.get_text(strip = True)t = th.find_next("td").get_text(strip = True)index.append(h)data.append(t)df = pd.DataFrame(数据,索引=索引,列= [值"])打印(df) 

打印:

 值项目名称海南省三亚市吉阳区溪泽南路18号兰海水都花园29幢项目编号GR2021BJ1000186受让方名称**交易方式网络竞价...等等. 

I try to extract content from this link in the blue circle from image below:

Code:

import requests
from bs4 import BeautifulSoup

url = 'https://www.cspea.com.cn/list/c01/gr2020bj1005297-3'
res = requests.get(url, verify = False)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
 'a',
 'b',
 'body',
 'div',
 'em',
 'h1',
 'h2',
 'h3',
 'head',
 'html',
 'i',
 'meta',
 'p',
 'script',
 # 'span',
 # 'td',
 # 'th',
 # 'title'
    # there may be more elements you don't want, such as "style", etc.
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)

How could extract data and save the content as dataframe?

解决方案

You can use this example as a basis to scrape the page (as I don't know chinese, I get all cells to dataframe - you can remove rows from the dataframe you don't need afterwards):

import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186"

soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")

index, data = [], []
for th in soup.select(".project-detail-left th"):
    h = th.get_text(strip=True)
    t = th.find_next("td").get_text(strip=True)
    index.append(h)
    data.append(t)

df = pd.DataFrame(data, index=index, columns=["value"])
print(df)

Prints:

                                                             value
项目名称                                     海南省三亚市吉阳区溪泽南路18号兰海水都花园29幢
项目编号                                               GR2021BJ1000186
受让方名称                                                           **
交易方式                                                          网络竞价

...etc.

这篇关于从网页中提取内容并保存为Python中的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆