将网页抓取结果存储在 DataFrame 或字典中 [英] Store web scraping results in DataFrame or dictionary

查看：37 发布时间：2021/9/24 19:06:03 python dictionary dataframe web-scraping beautifulsoup

本文介绍了将网页抓取结果存储在 DataFrame 或字典中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在参加一门在线课程，我正在尝试自动化该过程，为我的个人笔记捕获课程结构，并将其保存在本地的 Markdown 文件中.

这是一个示例章节:

以下是 HTML 外观的示例:

 <div class="chapter__header"><div class="chapter__title-wrapper"><span class="chapter__number"><span class="chapter-number">1</span></span><h4 class="chapter__title">实验设计导论<span class="chapter__price">自由</span>

<div class="dc-progress-bar dc-progress-bar--small Chapter__progress"><span class="dc-progress-bar__text">0%</span><div class="dc-progress-bar__bar Chapter__progress-bar"><span class="dc-progress-bar__fill" style="width: 0%;"></span>

from urllib.request import urlopen从 bs4 导入 BeautifulSoupurl = 'https://www.datacamp.com/courses/experimental-design-in-r'html = urlopen(url)汤 = BeautifulSoup(html, 'lxml')课程大纲 = 汤.find_all(['h4', 'li'])大纲列表 = []对于课程大纲中的项目:属性 = item.attrs尝试:class_type = 属性['class'][0]如果 class_type == 'chapter__title':outline_list.append(item.text.strip())如果 class_type == 'chapter__exercise':课程名称 = item.find('h5').text课程链接 = item.find('a').attrs['href']outline_list.append(课程名称)outline_list.append(课程链接)除了 KeyError:经过

#实验设计介绍* [实验设计介绍](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)* ['一个基础实验](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)

from urllib.request import urlopen从 bs4 导入 BeautifulSoup从集合导入 OrderedDicturl = 'https://www.datacamp.com/courses/experimental-design-in-r'html = urlopen(url)汤 = BeautifulSoup(html, 'lxml')课程大纲 = 汤.find_all(['h4', 'li'])# 使用 OrderedDict 确保结果的顺序与源中的顺序相同Chapters = OrderedDict() # {chapter: [(lesson_name, course_link), ...], ...}对于课程大纲中的项目:属性 = item.attrs尝试:class_type = 属性['class'][0]如果 class_type == 'chapter__title':章节 = item.text.strip()章节[章节] = []如果 class_type == 'chapter__exercise':课程名称 = item.find('h5').text课程链接 = item.find('a').attrs['href']章节[章节].附加((课程名称，课程链接))除了 KeyError:经过

<div class="chapter__header"> <div class="chapter__title-wrapper"> <span class="chapter__number"> <span class="chapter-number">1</span> </span> <h4 class="chapter__title"> Introduction to Experimental Design </h4> <span class="chapter__price"> Free </span> </div> <div class="dc-progress-bar dc-progress-bar--small chapter__progress"> <span class="dc-progress-bar__text">0%</span> <div class="dc-progress-bar__bar chapter__progress-bar"> <span class="dc-progress-bar__fill" style="width: 0%;"></span> </div> </div> </div> <p class="chapter__description"> An introduction to key parts of experimental design plus some power and sample size calculations. </p>   <ul class="chapter__exercises hidden"> <li class="chapter__exercise "> <a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1"> <span class="chapter__exercise-icon exercise-icon "> <img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" /> </span> <h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5> <span class="chapter__exercise-xp"> 50 xp </span> </a> </li>

from urllib.request import urlopen from bs4 import BeautifulSoup url = 'https://www.datacamp.com/courses/experimental-design-in-r' html = urlopen(url) soup = BeautifulSoup(html, 'lxml') lesson_outline = soup.find_all(['h4', 'li']) outline_list = [] for item in lesson_outline: attributes = item.attrs try: class_type = attributes['class'][0] if class_type == 'chapter__title': outline_list.append(item.text.strip()) if class_type == 'chapter__exercise': lesson_name = item.find('h5').text lesson_link = item.find('a').attrs['href'] outline_list.append(lesson_name) outline_list.append(lesson_link) except KeyError: pass

# Introduction to Experimental Design * [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1) * ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)

from urllib.request import urlopen from bs4 import BeautifulSoup from collections import OrderedDict url = 'https://www.datacamp.com/courses/experimental-design-in-r' html = urlopen(url) soup = BeautifulSoup(html, 'lxml') lesson_outline = soup.find_all(['h4', 'li']) # Using OrderedDict assures that the order of the result will be the same as in the source chapters = OrderedDict() # {chapter: [(lesson_name, lesson_link), ...], ...} for item in lesson_outline: attributes = item.attrs try: class_type = attributes['class'][0] if class_type == 'chapter__title': chapter = item.text.strip() chapters[chapter] = [] if class_type == 'chapter__exercise': lesson_name = item.find('h5').text lesson_link = item.find('a').attrs['href'] chapters[chapter].append((lesson_name, lesson_link)) except KeyError: pass

将网页抓取结果存储在 DataFrame 或字典中 [英] Store web scraping results in DataFrame or dictionary

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将网页抓取结果存储在 DataFrame 或字典中 [英] Store web scraping results in DataFrame or dictionary

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭