将网页抓取结果存储在 DataFrame 或字典中 [英] Store web scraping results in DataFrame or dictionary

查看:37
本文介绍了将网页抓取结果存储在 DataFrame 或字典中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在参加一门在线课程,我正在尝试自动化该过程,为我的个人笔记捕获课程结构,并将其保存在本地的 Markdown 文件中.

这是一个示例章节:

以下是 HTML 外观的示例:

 <div class="chapter__header"><div class="chapter__title-wrapper"><span class="chapter__number"><span class="chapter-number">1</span></span><h4 class="chapter__title">实验设计导论<span class="chapter__price">自由</span>

<div class="dc-progress-bar dc-progress-bar--small Chapter__progress"><span class="dc-progress-bar__text">0%</span><div class="dc-progress-bar__bar Chapter__progress-bar"><span class="dc-progress-bar__fill" style="width: 0%;"></span>

<p class="chapter__description">对实验设计关键部分的介绍以及一些功效和样本量计算.</p><!-- !章节标题--><!-- 章节正文--><ul class="chapter__exercises hidden"><li class="chapter__exercise"><a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1"><span class="chapter__exercise-iconexercise-icon"><img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca1bb3471v="video&Id4ddca1bb3471v</span><h5 class="chapter__exercise-title" title='实验设计介绍'>实验设计介绍</h5><span class="chapter__exercise-xp">50 经验</span></a>

到目前为止,我已经使用 BeautifulSoup 提取了所有相关信息:

from urllib.request import urlopen从 bs4 导入 BeautifulSoupurl = 'https://www.datacamp.com/courses/experimental-design-in-r'html = urlopen(url)汤 = BeautifulSoup(html, 'lxml')课程大纲 = 汤.find_all(['h4', 'li'])大纲列表 = []对于课程大纲中的项目:属性 = item.attrs尝试:class_type = 属性['class'][0]如果 class_type == 'chapter__title':outline_list.append(item.text.strip())如果 class_type == 'chapter__exercise':课程名称 = item.find('h5').text课程链接 = item.find('a').attrs['href']outline_list.append(课程名称)outline_list.append(课程链接)除了 KeyError:经过

这给了我一个这样的列表:

['实验设计介绍'、'实验设计介绍'、'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]

我的目标是将所有内容放入一个 .md 文件中,该文件看起来像这样:

#实验设计介绍* [实验设计介绍](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)* ['一个基础实验](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)

我的问题是:构建这些数据的最佳方法是什么,以便我以后在编写文本文件时可以轻松访问它?有一个包含 chapterlessonlesson_link 列的 DataFrame 会更好吗?具有多索引的数据帧?嵌套字典?如果是字典,我应该给键起什么名字?还是我错过了另一种选择?某种数据库?

任何想法将不胜感激!

解决方案

如果我没看错,您目前正在将每个元素按其出现的顺序附加到列表 outline_list.但显然你没有 1,而是 3 种不同的数据:

每个标题可以有多个练习,总是一对namelink.由于您还希望为文本文件保留此结构中的数据,因此您可以提出表示此层次结构的任何结构.一个例子:

from urllib.request import urlopen从 bs4 导入 BeautifulSoup从集合导入 OrderedDicturl = 'https://www.datacamp.com/courses/experimental-design-in-r'html = urlopen(url)汤 = BeautifulSoup(html, 'lxml')课程大纲 = 汤.find_all(['h4', 'li'])# 使用 OrderedDict 确保结果的顺序与源中的顺序相同Chapters = OrderedDict() # {chapter: [(lesson_name, course_link), ...], ...}对于课程大纲中的项目:属性 = item.attrs尝试:class_type = 属性['class'][0]如果 class_type == 'chapter__title':章节 = item.text.strip()章节[章节] = []如果 class_type == 'chapter__exercise':课程名称 = item.find('h5').text课程链接 = item.find('a').attrs['href']章节[章节].附加((课程名称,课程链接))除了 KeyError:经过

从那里编写文本文件应该很容易:

对于章节,章节中的课程.items():# 写章节标题对于课程名称,课程链接:#写课

I'm taking an online course, and I'm trying to automate the process capturing the course structure for my personal notes, which I keep locally in a Markdown file.

Here's an example chapter:

And here's a sample of how the HTML looks:

  <!-- Header of the chapter -->
  <div class="chapter__header">
      <div class="chapter__title-wrapper">
        <span class="chapter__number">
          <span class="chapter-number">1</span>
        </span>
        <h4 class="chapter__title">
          Introduction to Experimental Design
        </h4>
          <span class="chapter__price">
            Free
          </span>
      </div>
      <div class="dc-progress-bar dc-progress-bar--small chapter__progress">
        <span class="dc-progress-bar__text">0%</span>
        <div class="dc-progress-bar__bar chapter__progress-bar">
          <span class="dc-progress-bar__fill" style="width: 0%;"></span>
        </div>
      </div>
  </div>
  <p class="chapter__description">
    An introduction to key parts of experimental design plus some power and sample size calculations.
  </p>
  <!-- !Header of the chapter -->

<!-- Body of the chapter -->
  <ul class="chapter__exercises hidden">
      <li class="chapter__exercise ">
        <a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
          <span class="chapter__exercise-icon exercise-icon ">
            <img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
          </span>
          <h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
          <span class="chapter__exercise-xp">
            50 xp
          </span>
</a>      </li>

So far, I've used BeautifulSoup to pull out all the relevant information:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

outline_list = []

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            outline_list.append(item.text.strip())
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            outline_list.append(lesson_name)
            outline_list.append(lesson_link)
    except KeyError:
        pass

This gives me a list like this:

['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]

My goal is to put this all into an .md file that would look something like this:

# Introduction to Experimental Design

* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)

My question is: What's the best way to structure this data so that I can easily access it later on when I'm writing the text file? Would it be better to have a DataFrame with columns chapter, lesson, lesson_link? A DataFrame with a MultiIndex? A nested dictionary? If it were a dictionary, what should I name the keys? Or is there another option I'm missing? Some sort of database?

Any thoughts would be much appreciated!

解决方案

If I see it right, you're currently appending every element in order of it's appearance to the list outline_list. But obviously you don't have 1, but instead 3 types of distinct data:

Each title can have multiple exercises, which are always a pair of name and link. Since you also want to keep the data in this structure for your text-file, you can come up with any structure that represents this hierarchy. An example:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

# Using OrderedDict assures that the order of the result will be the same as in the source
chapters = OrderedDict()   # {chapter: [(lesson_name, lesson_link), ...], ...}

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            chapter = item.text.strip()
            chapters[chapter] = []
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            chapters[chapter].append((lesson_name, lesson_link))
    except KeyError:
        pass

From there it should be easy to write your text file:

for chapter, lessons in chapters.items():
    # write chapter title
    for lesson_name, lesson_link in lessons:
        # write lesson

这篇关于将网页抓取结果存储在 DataFrame 或字典中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆