用 pandas 解析漂亮打印的表格数据 [英] Parse prettyprinted tabular data with pandas
问题描述
复制包含不同分隔符,列名称中的空格等的表的最佳方法是什么.函数pd.read_clipboard()
无法独自管理此任务.
What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard()
cannot manage this task on its own.
示例1:
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
预期结果:
Age Category A B C D
21-26 2 2 4 1
26-31 7 11 12 5
31-36 3 5 5 2
36-41 2 4 1 7
41-46 0 1 3 2
46-51 0 0 2 3
示例2:
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
预期结果:
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
我正在寻找一种可以应用于最常见的表类型的通用方法.
I look for a universal approach that can be applied to the most common table types.
推荐答案
一种选择是硬着头皮对数据进行预处理.这还不算太糟糕,pd.read_csv
只能在其参数中处理很多情况,如果您想穷尽所有要处理的情况,最终将不得不转向正则表达式.
One option is to bite the bullet and just preprocess your data. This isn't all that bad, there's only so many cases pd.read_csv
can handle in its arguments, and if you want to be exhaustive with the cases you handle you'll eventually end up turning to regex.
要处理大多数漂亮表的常见情况,我只需编写一个循环来过滤/替换行中的字符,然后使用相对简单的read_csv
调用读入输出.
To handle most of the common cases of prettyprinted tables, I'd just write a loop to filter out/replace characters in lines, then read in the output using a relatively simpler read_csv
call.
import os
def load(filename):
with open(filename) as fin, open('temp.txt', 'w') as fout:
for line in fin:
if not line.strip()[:2] in {'|-', '+-'}: # filter step
fout.write(line.strip().strip('|').replace('|', ',')+'\n')
df = pd.read_csv('temp.txt', sep=r'\s*,\s*', engine='python')
os.unlink('temp.txt') # cleanup
return df
df1 = load('data1.txt')
df2 = load('data2.txt')
df1
Age Category A B C
0 21-26 2 2 4
1 26-31 7 11 12
2 31-36 3 5 5
3 36-41 2 4 1
4 41-46 0 1 3
5 46-51 0 0 2
df2
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
这篇关于用 pandas 解析漂亮打印的表格数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!