从网页中提取内容并保存为Python中的数据框 [英] Extract content from web pages and save as dataframe in Python

查看：84 发布时间：2021/4/15 19:13:49 python-3.x pandas web-scraping beautifulsoup web-crawler

本文介绍了从网页中提取内容并保存为Python中的数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试从的代码:

 导入请求从bs4导入BeautifulSoupurl ='https://www.cspea.com.cn/list/c01/gr2020bj1005297-3'res = requests.get(URL，verify = False)html_page = res.content汤= BeautifulSoup(html_page，'html.parser')文字=汤.find_all(文字=真)输出=''黑名单= ['[文档]'，'一种'，'b'，'身体'，'div'，'em'，'h1'，'h2'，'h3'，'头'，'html'，'一世'，元"，'p'，'脚本'，# '跨度'，#'td'，#'th'，# '标题'#可能会有更多您不想要的元素，例如样式"等.]对于文本中的t:如果t.parent.name不在黑名单中:输出+ ='{}'.format(t)打印(输出)

如何提取数据并将内容另存为数据框?

解决方案

您可以使用此示例作为抓取页面的基础(因为我不知道中文，所以我将所有单元格都放入了数据框-您可以从中删除行之后不需要的数据框):

 导入urllib3urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)汇入要求从bs4导入BeautifulSoup将熊猫作为pd导入url ="https://www.cspea.com.cn/list/c01/gr2021bj1000186"汤= BeautifulSoup(requests.get(URL，verify = False).content，"html.parser")索引，数据= []，[]对于汤中的th.select(.project-detail-left th"):h = th.get_text(strip = True)t = th.find_next("td").get_text(strip = True)index.append(h)data.append(t)df = pd.DataFrame(数据，索引=索引，列= [值"])打印(df)

打印:

 值项目名称海南省三亚市吉阳区溪泽南路18号兰海水都花园29幢项目编号GR2021BJ1000186受让方名称**交易方式网络竞价...等等.

I try to extract content from this link in the blue circle from image below:

Code:
import requests from bs4 import BeautifulSoup url = 'https://www.cspea.com.cn/list/c01/gr2020bj1005297-3' res = requests.get(url, verify = False) html_page = res.content soup = BeautifulSoup(html_page, 'html.parser') text = soup.find_all(text=True) output = '' blacklist = [ '[document]', 'a', 'b', 'body', 'div', 'em', 'h1', 'h2', 'h3', 'head', 'html', 'i', 'meta', 'p', 'script', # 'span', # 'td', # 'th', # 'title' # there may be more elements you don't want, such as "style", etc. ] for t in text: if t.parent.name not in blacklist: output += '{} '.format(t) print(output)
How could extract data and save the content as dataframe?
解决方案
You can use this example as a basis to scrape the page (as I don't know chinese, I get all cells to dataframe - you can remove rows from the dataframe you don't need afterwards):
import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) import requests from bs4 import BeautifulSoup import pandas as pd url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186" soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser") index, data = [], [] for th in soup.select(".project-detail-left th"): h = th.get_text(strip=True) t = th.find_next("td").get_text(strip=True) index.append(h) data.append(t) df = pd.DataFrame(data, index=index, columns=["value"]) print(df)
Prints:
value 项目名称海南省三亚市吉阳区溪泽南路18号兰海水都花园29幢项目编号 GR2021BJ1000186 受让方名称 ** 交易方式网络竞价 ...etc.

这篇关于从网页中提取内容并保存为Python中的数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从网页中提取内容并保存为Python中的数据框 [英] Extract content from web pages and save as dataframe in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从网页中提取内容并保存为Python中的数据框 [英] Extract content from web pages and save as dataframe in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭