pandas read_html 在读取之前或之后进行清理 [英] pandas read_html clean up before or after read

查看:44
本文介绍了pandas read_html 在读取之前或之后进行清理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将此 html 中的最后一个表放入数据表中.

I'm trying to get the last table in this html into a data table.

代码如下:

import pandas as pd
a=pd.read_html('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')
print (a[23])

正如你所看到的,它读入了它,但需要清理.我的问题是针对有使用此功能经验的人.读入它然后尝试在之后或之前清理它会更好吗?如果有人知道怎么做,请发布一些代码.谢谢.

As you can see it reads it in, but needs to be cleaned up. My question is for someone who has experience with using this function. Is it better to read it in and then try to clean it up afterwards or before? And if anybody knows how to do it, please post some code. Thanks.

推荐答案

清理原始数据总是更好,因为任何处理都可能引入伪影.您的 HTML 表是使用 span 功能创建的,这就是为什么如果在 HTML 解析后清理 DataFrame 就不可能以通用方式提取数据的原因.所以我建议你安装一个专门用于此的小模块:从 HTML 中提取数据表.在命令行中运行

It is always better to clean original data, because any processing might introduce artifacts. Your HTML table is created using span feature, and this is why it impossible to extract the data in generic way if you clean the DataFrame after HTML parsing. So I suggest you install a small module which is intended exactly to this: extracting data out of HTML tables. Run in your command line

pip install html-table-extractor 

在获得页面的原始 HTML 之后(您还需要 requests),处理表格并清除重复条目:

After this get the raw HTML of the page (you will need requests also), process the table and clean duplicate entries:

import requests
import pandas as pd
from collections import OrderedDict
from html_table_extractor.extractor import Extractor

pd.set_option('display.width', 400)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)

# get raw html
resp = requests.get('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')

# find last table
beg = resp.text.rfind('<table')
end = resp.text.rfind('</table')
html = resp.text[beg:end+8]

# process table
ex = Extractor(html)
ex.parse()
list_of_lines = ex.return_list()

# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)

## we need to consolidate some columns

# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitions
for el in list_of_lines[names_line]:
    col_names[el] = [i for i, x in enumerate(list_of_lines[names_line]) if x == el]

# now consolidate repetitive values
storage = OrderedDict() # this will contain columns
for k in col_names:
    res = []
    for line in list_of_lines[names_line+1:]:  # first 2 lines are empty, third is column names
        joined = [] # <- this list will accumulate *unique* values to become a single cell
        for idx in col_names[k]:
            el = line[idx]
            if joined and joined[-1]==el:   # if value already exist, skip
                continue
            joined.append(el)   # add unique value to cell
        res.append(''.join(joined))   # add cell to column
    storage[k] = res   # add column to storage
df = pd.DataFrame(storage)
print(df)

这将产生以下结果,非常接近原始结果:

This will produce the following result, which is very close to original:

                                                                                                        Q1`17                   Q2`17                   Q3`17                   Q4`17                 FY 2017                   Q1`18
0                                                                                      (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)
1                                                                                                 (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)
2                                                                    Customer metrics                                                                                                                                                
3                                                               Customer accounts (1)                 57,000+                 61,000+                 65,000+                 70,000+                 70,000+                 74,000+
4                                               Customer accounts added in period (1)                  3,300+                  4,000+                  4,100+                  4,700+                 16,100+                  3,900+
5                                                     Deals greater than $100,000 (2)                     294                     372                     337                     590                   1,593                     301
6   Customer accounts that purchased greater than $1 million during the quarter (1,2)                      10                      15                      13                      27                                              13
7                                                                                                                                                                                                                                    
8                                                    Annual recurring revenue metrics                                                                                                                                                
9                                                  Total annual recurring revenue (3)                $439,001                $483,578                $526,211                $596,244                $596,244                $641,946
10                                          Subscription annual recurring revenue (4)                 $71,950                $103,538                $139,210                $195,488                $195,488                $237,533
11                                                                                                                                                                                                                                   
12                                               Geographic revenue metrics - ASC 606                                                                                                                                                
13                                                           United States and Canada                       —                       —                       —                       —                       —                $167,799
14                                                                      International                       —                       —                       —                       —                       —                 $78,408
..                                                                                ...                     ...                     ...                     ...                     ...                     ...                     ...
23                                                                                                                                                                                                                                   
24                                               Additional revenue metrics - ASC 606                                                                                                                                                
25                                              Remaining performance obligations (5)                       —                       —                       —                       —                 $99,580                $114,523
26                                                                                                                                                                                                                                   
27                                               Additional revenue metrics - ASC 605                                                                                                                                                
28                                          Ratable revenue as % of total revenue (6)                     54%                     56%                     63%                     60%                     59%                     72%
29                          Ratable license revenue as % of total license revenue (7)                     19%                     23%                     34%                     34%                     28%                     54%
30                   Services revenues as a % of maintenance and services revenue (8)                     12%                     13%                     12%                     13%                     13%                     11%
31                                                                                                                                                                                                                                   
32                                                         Bookings metrics - ASC 605                                                                                                                                                
33                                        Ratable bookings as % of total bookings (2)                     55%                     61%                     65%                     70%                     64%                     72%
34                        Ratable license bookings as % of total license bookings (2)                     26%                     37%                     45%                     51%                     41%                     59%
35                                                                                                                                                                                                                                   
36                                                                      Other metrics                                                                                                                                                
37                                                                Worldwide employees                   3,193                   3,305                   3,418                   3,489                   3,489                   3,663

这篇关于pandas read_html 在读取之前或之后进行清理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆