从混乱的.csv文件中解析/提取表? [英] Parse / Extract table from a messed .csv file?

查看:157
本文介绍了从混乱的.csv文件中解析/提取表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Amazon Textract解析图像(png)并提取表. 这是当我使用open(file_name, "r")打开并读取其行时的此类csv的示例:

I am parsing an image (png) with Amazon Textract and extracting the tables. Here is an example of such csv when I open it with open(file_name, "r") and reading it's lines:

['Table: Table_1\n',
 '\n',
 'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
 'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
 'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
 'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
 'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
 'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
 'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
 'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
 'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
 'AST ,27 ,,10-35 U/L ,EN ,\n',
 'ALT ,19 ,,9-46 U/L ,EN ,\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

我可以用pandas read_csv读取它,但出现错误(它总是以不同的格式出现-或多或少的空格,标题前的第一行不同). 请告知如何从此类csv中提取表格?

I can read it with pandas read_csv but I am getting errors (it's always come as different format - more or less spaces, different first lines before the titles). Please advise how to extract the table from such csv's?

推荐答案

我建议整理您的数据,将整理后的数据作为列表列表插入到Pandas中.我在您的样本中发现的问题是,在第一个字段中,它包含逗号,它们也干扰了CSV解析,并且还通过逗号分隔符起作用.因此,需要对数据进行管理. 请在下面找到我的Python 3源代码:

I would suggest to curate your data, inserting curated data onto Pandas as list of list. The problem I've found with your sample is that, in the first field, it contains comas, which interfere with CSV parsing, working by coma separator as well. Thus, a curations of the data is required. Please, find my source code for Python 3 below:

data = ['Table: Table_1\n',
        '\n',
        'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
        'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
        'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
        'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
        'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
        'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
        'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
        'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
        'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
        'AST ,27 ,,10-35 U/L ,EN ,\n',
        'ALT ,19 ,,9-46 U/L ,EN ,\n',
        '\n',
        '\n',
        '\n',
        '\n',
        '\n']



lines  = [x.replace('\n','') for x in data]

import re
p = re.compile('^[/A-Z ]+[,]*[/A-Z ]*,')
curated_lines = []
for l in lines:
    m = p.search(l)
    if m != None:
        s   = m.group(0)
        cs  = s.replace(',','')
        cl  = l.replace(s,cs+',')
        curated_lines.append(cl)

frame_list_of_list = [l.split(',')[:-1] for l in curated_lines]

import pandas as pd
df = pd.DataFrame(frame_list_of_list,columns=['Test Name','Result','Flag','Reference Range','Lab'])
print(df)

将产生以下结果:

                           Test Name Result  Flag        Reference Range  Lab
0  HEPATIC FUNCTION PANEL PROTEIN TOTAL    6.1                 6.1-8.1 g/dL   EN 
1                               ALBUMIN    4.3                 3.6-5.1 g/dL   EN 
2                              GLOBULIN    1.8   LOW    1.9-3.7 g/dL (calc)   EN 
3                ALBUMIN/GLOBULIN RATIO    2.4               1.0-2.5 (calc)   EN 
4                       BILIRUBIN TOTAL    0.6                0.2-1.2 mg/dL   EN 
5                      BILIRUBIN DIRECT    0.2             < OR = 0.2 mg/dL   EN 
6                    BILIRUBIN INDIRECT    0.4         0.2-1.2 mg/dL (calc)   EN 
7                  ALKALINE PHOSPHATASE     61                   40-115 U/L   EN 
8                                   AST     27                    10-35 U/L   EN 
9                                   ALT     19                     9-46 U/L   EN 

这篇关于从混乱的.csv文件中解析/提取表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆