蟒蛇硒刮体 [英] python selenium scraping tbody
问题描述
下面是我要抓取的HTML代码
The below is the HTML code which I'm trying to scrape
<div class="data-point-container section-break">
# some other HTML div classes here which I don't need
<table class data-bind="showHidden: isData">
<!-- ko foreach : sections -->
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<!-- /ko -->
</table>
</div>
我如何使用Pandas.read_html
抓取所有这些信息,以thead
作为标题,以tbody
作为值?
How do I use Pandas.read_html
to scrape all these information, having thead
as headers, and tbody
as values?
这是我要抓取的网站,并将数据提取到Pandas Dataframe中. 在此处链接
This is the site that I'm trying to scrape, and have the data extracted into Pandas Dataframe. Link here
推荐答案
严格来说,一个人最多只能有一个thead
元素 (根据table
元素规范).
Strictly speaking, one should not have more than one thead
element per table according to the table
element specification.
如果您仍然具有该thead
和其后的相应tbody
结构,我将对其进行迭代解析-像这样的每个结构都放入它自己的 dataframe 中.
If you still have this thead
followed by corresponding tbody
structure, I would parse that iteratively - every structure like this into it's own dataframe.
工作示例:
import pandas as pd
from bs4 import BeautifulSoup
data = """
<div class="data-point-container section-break">
<table class data-bind="showHidden: isData">
<thead>
<tr><th>Customer</th><th>Order</th><th>Month</th></tr>
</thead>
<tbody>
<tr><td>Customer 1</td><td>#1</td><td>January</td></tr>
<tr><td>Customer 2</td><td>#2</td><td>April</td></tr>
<tr><td>Customer 3</td><td>#3</td><td>March</td></tr>
</tbody>
<thead>
<tr><th>Customer</th></tr>
</thead>
<tbody>
<tr><td>Customer 4</td></tr>
<tr><td>Customer 5</td></tr>
<tr><td>Customer 6</td></tr>
</tbody>
</table>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for thead in soup.select(".data-point-container table thead"):
tbody = thead.find_next_sibling("tbody")
table = "<table>%s</table>" % (str(thead) + str(tbody))
df = pd.read_html(str(table))[0]
print(df)
print("-----")
打印2个数据帧-样本输入HTML中的每个主题都有一个数据帧:
Prints 2 dataframes - one for every thead&tbody in the sample input HTML:
Customer Order Month
0 Customer 1 #1 January
1 Customer 2 #2 April
2 Customer 3 #3 March
-----
Customer
0 Customer 4
1 Customer 5
2 Customer 6
-----
请注意,出于演示目的,我故意在每个块中使标头和数据单元的数量不同.
Note that I've intentionally made the number of header and data cells different in every block for demonstration purposes.
这篇关于蟒蛇硒刮体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!