用 pandas 解析漂亮打印的表格数据 [英] Parse prettyprinted tabular data with pandas

查看:129
本文介绍了用 pandas 解析漂亮打印的表格数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

复制包含不同分隔符,列名称中的空格等的表的最佳方法是什么.函数pd.read_clipboard()无法独自管理此任务.

What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard() cannot manage this task on its own.

示例1:

| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |

预期结果:

 Age Category  A  B   C   D    
 21-26         2  2   4   1 
 26-31         7  11  12  5 
 31-36         3  5   5   2 
 36-41         2  4   1   7 
 41-46         0  1   3   2 
 46-51         0  0   2   3

示例2:

+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

预期结果:

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

我正在寻找一种可以应用于最常见的表类型的通用方法.

I look for a universal approach that can be applied to the most common table types.

推荐答案

一种选择是硬着头皮对数据进行预处理.这还不算太糟糕,pd.read_csv只能在其参数中处理很多情况,如果您想穷尽所有要处理的情况,最终将不得不转向正则表达式.

One option is to bite the bullet and just preprocess your data. This isn't all that bad, there's only so many cases pd.read_csv can handle in its arguments, and if you want to be exhaustive with the cases you handle you'll eventually end up turning to regex.

要处理大多数漂亮表的常见情况,我只需编写一个循环来过滤/替换行中的字符,然后使用相对简单的read_csv调用读入输出.

To handle most of the common cases of prettyprinted tables, I'd just write a loop to filter out/replace characters in lines, then read in the output using a relatively simpler read_csv call.

import os 

def load(filename):
    with open(filename) as fin, open('temp.txt', 'w') as fout:
        for line in fin:
            if not line.strip()[:2] in {'|-', '+-'}: # filter step
                fout.write(line.strip().strip('|').replace('|', ',')+'\n')

    df = pd.read_csv('temp.txt', sep=r'\s*,\s*', engine='python')
    os.unlink('temp.txt') # cleanup

    return df

df1 = load('data1.txt')
df2 = load('data2.txt')

df1

  Age Category  A   B   C
0        21-26  2   2   4
1        26-31  7  11  12
2        31-36  3   5   5
3        36-41  2   4   1
4        41-46  0   1   3
5        46-51  0   0   2

df2

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

这篇关于用 pandas 解析漂亮打印的表格数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆