html_read:只读数据,不是表格的形状[A行x B列](Python) [英] html_read: Read only data, not shape of table [A rows x B columns] (Python)

查看:79
本文介绍了html_read:只读数据,不是表格的形状[A行x B列](Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python 3.7循环使用pd.read_html从网站上抓取数据,并且很难将其导出.

I'm scraping data from a website using pd.read_html in a loop using Python 3.7, and struggle to export it.

html字符串的相关部分:

Relevant part of the html string:

html_source =

<div class="reiterZwischenzeile">
    &nbsp;
</div>
<table class="tabelleOhneWidth" width="100%" cellspacing="0px">
    <colgroup>
        <col class="left" width="300px" valign="middle">
        <col class="left" width="80px" valign="middle">
        <col class="left" width="80px" valign="middle">
        <col class="left" width="80px" valign="middle">
        <col class="left" width="80px" valign="middle">
        <col class="left" width="20px" valign="middle">
        <col class="left" width="80px" valign="middle">
        <col class="left" width="80px" valign="middle">
        <col class="left" width="20px" valign="middle">
        <col class="left" width="20px" valign="middle">
    </colgroup>
    <tbody><tr>
        <td class="tabelleKopfUo left" colspan="2" rowspan="2">
            Teilarbeit
        </td>
        <td class="tabelleKopfUo center" rowspan="2">
            Arbeitszeit-<br>bedarf
        </td>
        <td class="tabelleKopfUo center" rowspan="2">
            Flächen-<br>leistung
        </td>
        <td class="tabelleKopfUo center" colspan="5">
            Maschinenkosten
        </td>
        <td class="tabelleKopfUo center" rowspan="2">
            Diesel-<br>bedarf
        </td>
    </tr>
    <tr>
        <td class="tabelleKopfOoUo center">
            Abschreibung
        </td>
        <td class="tabelleKopfOoUo center">
            Zinskosten
        </td>
        <td class="tabelleKopfOoUo center">
            Sonstiges&nbsp;<img src="images/info_white_10.png" border="none"> 
        </td>
        <td class="tabelleKopfOoUo center">
            Reparaturen
        </td>
        <td class="tabelleKopfOoUo center">
            Betriebsstoffe
        </td>
    </tr>
    <tr>
        <td class="tabelleKopfOo center" colspan="2"></td>
        <td class="tabelleKopfOo center">
            Akh/ha
        </td>
        <td class="tabelleKopfOo center">
            ha/h
        </td>
        <td class="tabelleKopfOo center" colspan="5">
            €/ha
        </td>
        <td class="tabelleKopfOo center" colspan="5">
            l/ha
        </td>
    </tr>

        <tr>
            <td class="tabelleEbene2  left">
                2.000 l, Aufbaupflanzenschutzspritze; 138 kW
            </td>
            <td class="tabelleEbene2  right">
                Feldarbeit
            </td>
            <td class="tabelleEbene2  right">
                0.11
            </td>
            <td class="tabelleEbene2  right">
                9.09
            </td>
            <td class="tabelleEbene2  right">
                3.72
            </td>
            <td class="tabelleEbene2  right">
                0.91
            </td>
            <td class="tabelleEbene2  right">
                0.24
            </td>
            <td class="tabelleEbene2  right">
                1.59
            </td>
            <td class="tabelleEbene2  right">
                0.68
            </td>
            <td class="tabelleEbene2  right">
                0.90
            </td>
        </tr>



</tbody></table>




然后我在每次迭代中都读取html表,如下所示:

Then I read the html tables in every iteration like this:


        df_list = pd.read_html(html_source, skiprows = [0,1,2])


打印df_list可以给我这个(索引df_list [0]也不行):

Printing df_list gives me this (indexing df_list[0] doesn't help either):

print(df_list)

[                                             0           1     2   ...  11  12  13
0  2.000 l, Aufbaupflanzenschutzspritze; 138 kW  Feldarbeit  0.11  ...            

[1 rows x 14 columns]]

我用一个简单的html代码尝试了相同的操作:

I tried the same with a simple html code like this:

<html>

<body>


<table><tr></tr></table>
<table><tr></tr></table>


blablabal
blabalalb
slkjflsjbs
sjflsbsb


Table1
<table border=1>
<tr>
<td>Test1</td><td>3</td><td>6</td><td>8.8</td><td>Test</td>
</tr>
<tr>
</tr>
<td>4</td><td>7</td><td>8</td><td>88</td><td>Test</td>
<td>74</td><td>77</td><td>78</td><td>88</td><td>Test</td><td>74</td><td>77</td><td>78</td><td>88</td><td>Test</td>
</table>


</body>

<html>


htmlname = r"example.html"
html = open(htmlname, 'r')

source_code = html.read()
#print(source_code)
tables = pd.read_html(source_code, skiprows=[1])

print(tables)

[       0  1  2    3     4
0  Test1  3  6  8.8  Test]
>>>

当我从网站上阅读时,为什么会得到这个形状描述,并且如何摆脱它?

Why do I get this shape description when I read in from the website and how can I get rid of it?

推荐答案

尝试使用此选项:-

pd.options.display.show_dimensions = False
df_list = pd.read_html(html_source,skiprows=3)
print(df_list)

还要回答为什么它显示第一个html源的尺寸是 对于较新版本的熊猫,未在适合控制台的小型数据框显示尺寸.仅在数据帧输出较大时显示它们.范例:-就您而言

Also just to answer why does it show dimension for the first html source is that with pandas newer versions the dimensions are not shown for small dataframes where they fit console. They are only shown when the dataframe output is large. Example:- In your case

df = pd.concat(df_list)
df1 = df[df.columns[range(4)]]
df1

如果仅从df_list中选择4列,则由于第4列的数量少于14列,因此不会显示尺寸.

If you pick only 4 column from df_list it will not show dimension due to the lesser number of columns 4 as compared to 14.

这篇关于html_read:只读数据,不是表格的形状[A行x B列](Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆