蟒蛇硒刮体 [英] python selenium scraping tbody

查看:61
本文介绍了蟒蛇硒刮体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是我要抓取的HTML代码

The below is the HTML code which I'm trying to scrape

<div class="data-point-container section-break">
    # some other HTML div classes here which I don't need
    <table class data-bind="showHidden: isData">
          <!-- ko foreach : sections -->
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
          <!-- /ko -->
    </table>
</div>

我如何使用Pandas.read_html抓取所有这些信息,以thead作为标题,以tbody作为值?

How do I use Pandas.read_html to scrape all these information, having thead as headers, and tbody as values?

这是我要抓取的网站,并将数据提取到Pandas Dataframe中. 在此处链接

This is the site that I'm trying to scrape, and have the data extracted into Pandas Dataframe. Link here

推荐答案

严格来说,一个人最多只能有一个thead元素 (根据table元素规范).

Strictly speaking, one should not have more than one thead element per table according to the table element specification.

如果您仍然具有该thead和其后的相应tbody结构,我将对其进行迭代解析-像这样的每个结构都放入它自己的 dataframe 中.

If you still have this thead followed by corresponding tbody structure, I would parse that iteratively - every structure like this into it's own dataframe.

工作示例:

import pandas as pd
from bs4 import BeautifulSoup

data = """
<div class="data-point-container section-break">
    <table class data-bind="showHidden: isData">

        <thead>
            <tr><th>Customer</th><th>Order</th><th>Month</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 1</td><td>#1</td><td>January</td></tr>
            <tr><td>Customer 2</td><td>#2</td><td>April</td></tr>
            <tr><td>Customer 3</td><td>#3</td><td>March</td></tr>
        </tbody>

        <thead>
            <tr><th>Customer</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 4</td></tr>
            <tr><td>Customer 5</td></tr>
            <tr><td>Customer 6</td></tr>
        </tbody>

    </table>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
for thead in soup.select(".data-point-container table thead"):
    tbody = thead.find_next_sibling("tbody")

    table = "<table>%s</table>" % (str(thead) + str(tbody))

    df = pd.read_html(str(table))[0]
    print(df)
    print("-----")

打印2个数据帧-样本输入HTML中的每个主题都有一个数据帧:

Prints 2 dataframes - one for every thead&tbody in the sample input HTML:

     Customer Order    Month
0  Customer 1    #1  January
1  Customer 2    #2    April
2  Customer 3    #3    March
-----
     Customer
0  Customer 4
1  Customer 5
2  Customer 6
-----

请注意,出于演示目的,我故意在每个块中使标头和数据单元的数量不同.

Note that I've intentionally made the number of header and data cells different in every block for demonstration purposes.

这篇关于蟒蛇硒刮体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆