将网页抓取结果存储在 DataFrame 或字典中
[英] Store web scraping results in DataFrame or dictionary
本文介绍了将网页抓取结果存储在 DataFrame 或字典中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在参加一门在线课程,我正在尝试自动化该过程,为我的个人笔记捕获课程结构,并将其保存在本地的 Markdown 文件中.
这是一个示例章节:
以下是 HTML 外观的示例:
<div class="chapter__header"><div class="chapter__title-wrapper"><span class="chapter__number"><span class="chapter-number">1</span></span><h4 class="chapter__title">实验设计导论<span class="chapter__price">自由</span>
<div class="dc-progress-bar dc-progress-bar--small Chapter__progress"><span class="dc-progress-bar__text">0%</span><div class="dc-progress-bar__bar Chapter__progress-bar"><span class="dc-progress-bar__fill" style="width: 0%;"></span>
<p class="chapter__description">对实验设计关键部分的介绍以及一些功效和样本量计算.</p><!-- !章节标题--><!-- 章节正文--><ul class="chapter__exercises hidden"><li class="chapter__exercise"><a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1"><span class="chapter__exercise-iconexercise-icon"><img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca1bb3471v="video&Id4ddca1bb3471v</span><h5 class="chapter__exercise-title" title='实验设计介绍'>实验设计介绍</h5><span class="chapter__exercise-xp">50 经验</span></a>
到目前为止,我已经使用 BeautifulSoup
提取了所有相关信息:
from urllib.request import urlopen从 bs4 导入 BeautifulSoupurl = 'https://www.datacamp.com/courses/experimental-design-in-r'html = urlopen(url)汤 = BeautifulSoup(html, 'lxml')课程大纲 = 汤.find_all(['h4', 'li'])大纲列表 = []对于课程大纲中的项目:属性 = item.attrs尝试:class_type = 属性['class'][0]如果 class_type == 'chapter__title':outline_list.append(item.text.strip())如果 class_type == 'chapter__exercise':课程名称 = item.find('h5').text课程链接 = item.find('a').attrs['href']outline_list.append(课程名称)outline_list.append(课程链接)除了 KeyError:经过
这给了我一个这样的列表:
['实验设计介绍'、'实验设计介绍'、'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]
我的目标是将所有内容放入一个 .md
文件中,该文件看起来像这样:
#实验设计介绍* [实验设计介绍](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)* ['一个基础实验](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)
我的问题是:构建这些数据的最佳方法是什么,以便我以后在编写文本文件时可以轻松访问它?有一个包含 chapter
、lesson
、lesson_link
列的 DataFrame 会更好吗?具有多索引的数据帧?嵌套字典?如果是字典,我应该给键起什么名字?还是我错过了另一种选择?某种数据库?
任何想法将不胜感激!
解决方案
如果我没看错,您目前正在将每个元素按其出现的顺序附加到列表 outline_list
.但显然你没有 1,而是 3 种不同的数据:
chapter__title
chapter__exercise.name
chapter__exercise.link
每个标题可以有多个练习,总是一对name
和link
.由于您还希望为文本文件保留此结构中的数据,因此您可以提出表示此层次结构的任何结构.一个例子:
from urllib.request import urlopen从 bs4 导入 BeautifulSoup从集合导入 OrderedDicturl = 'https://www.datacamp.com/courses/experimental-design-in-r'html = urlopen(url)汤 = BeautifulSoup(html, 'lxml')课程大纲 = 汤.find_all(['h4', 'li'])# 使用 OrderedDict 确保结果的顺序与源中的顺序相同Chapters = OrderedDict() # {chapter: [(lesson_name, course_link), ...], ...}对于课程大纲中的项目:属性 = item.attrs尝试:class_type = 属性['class'][0]如果 class_type == 'chapter__title':章节 = item.text.strip()章节[章节] = []如果 class_type == 'chapter__exercise':课程名称 = item.find('h5').text课程链接 = item.find('a').attrs['href']章节[章节].附加((课程名称,课程链接))除了 KeyError:经过
从那里编写文本文件应该很容易:
对于章节,章节中的课程.items():# 写章节标题对于课程名称,课程链接:#写课
I'm taking an online course, and I'm trying to automate the process capturing the course structure for my personal notes, which I keep locally in a Markdown file.
Here's an example chapter:
And here's a sample of how the HTML looks:
<!-- Header of the chapter -->
<div class="chapter__header">
<div class="chapter__title-wrapper">
<span class="chapter__number">
<span class="chapter-number">1</span>
</span>
<h4 class="chapter__title">
Introduction to Experimental Design
</h4>
<span class="chapter__price">
Free
</span>
</div>
<div class="dc-progress-bar dc-progress-bar--small chapter__progress">
<span class="dc-progress-bar__text">0%</span>
<div class="dc-progress-bar__bar chapter__progress-bar">
<span class="dc-progress-bar__fill" style="width: 0%;"></span>
</div>
</div>
</div>
<p class="chapter__description">
An introduction to key parts of experimental design plus some power and sample size calculations.
</p>
<!-- !Header of the chapter -->
<!-- Body of the chapter -->
<ul class="chapter__exercises hidden">
<li class="chapter__exercise ">
<a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
<span class="chapter__exercise-icon exercise-icon ">
<img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
</span>
<h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
<span class="chapter__exercise-xp">
50 xp
</span>
</a> </li>
So far, I've used BeautifulSoup
to pull out all the relevant information:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
lesson_outline = soup.find_all(['h4', 'li'])
outline_list = []
for item in lesson_outline:
attributes = item.attrs
try:
class_type = attributes['class'][0]
if class_type == 'chapter__title':
outline_list.append(item.text.strip())
if class_type == 'chapter__exercise':
lesson_name = item.find('h5').text
lesson_link = item.find('a').attrs['href']
outline_list.append(lesson_name)
outline_list.append(lesson_link)
except KeyError:
pass
This gives me a list like this:
['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]
My goal is to put this all into an .md
file that would look something like this:
# Introduction to Experimental Design
* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)
My question is: What's the best way to structure this data so that I can easily access it later on when I'm writing the text file? Would it be better to have a DataFrame with columns chapter
, lesson
, lesson_link
? A DataFrame with a MultiIndex? A nested dictionary? If it were a dictionary, what should I name the keys? Or is there another option I'm missing? Some sort of database?
Any thoughts would be much appreciated!
解决方案
If I see it right, you're currently appending every element in order of it's appearance to the list outline_list
. But obviously you don't have 1, but instead 3 types of distinct data:
chapter__title
chapter__exercise.name
chapter__exercise.link
Each title can have multiple exercises, which are always a pair of name
and link
. Since you also want to keep the data in this structure for your text-file, you can come up with any structure that represents this hierarchy. An example:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict
url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
lesson_outline = soup.find_all(['h4', 'li'])
# Using OrderedDict assures that the order of the result will be the same as in the source
chapters = OrderedDict() # {chapter: [(lesson_name, lesson_link), ...], ...}
for item in lesson_outline:
attributes = item.attrs
try:
class_type = attributes['class'][0]
if class_type == 'chapter__title':
chapter = item.text.strip()
chapters[chapter] = []
if class_type == 'chapter__exercise':
lesson_name = item.find('h5').text
lesson_link = item.find('a').attrs['href']
chapters[chapter].append((lesson_name, lesson_link))
except KeyError:
pass
From there it should be easy to write your text file:
for chapter, lessons in chapters.items():
# write chapter title
for lesson_name, lesson_link in lessons:
# write lesson
这篇关于将网页抓取结果存储在 DataFrame 或字典中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!