从混乱的.csv文件中解析/提取表? [英] Parse / Extract table from a messed .csv file?
问题描述
我正在使用Amazon Textract解析图像(png)并提取表.
这是当我使用open(file_name, "r")
打开并读取其行时的此类csv的示例:
I am parsing an image (png) with Amazon Textract and extracting the tables.
Here is an example of such csv when I open it with open(file_name, "r")
and reading it's lines:
['Table: Table_1\n',
'\n',
'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
'AST ,27 ,,10-35 U/L ,EN ,\n',
'ALT ,19 ,,9-46 U/L ,EN ,\n',
'\n',
'\n',
'\n',
'\n',
'\n']
我可以用pandas
read_csv
读取它,但出现错误(它总是以不同的格式出现-或多或少的空格,标题前的第一行不同).
请告知如何从此类csv中提取表格?
I can read it with pandas
read_csv
but I am getting errors (it's always come as different format - more or less spaces, different first lines before the titles).
Please advise how to extract the table from such csv's?
推荐答案
我建议整理您的数据,将整理后的数据作为列表列表插入到Pandas中.我在您的样本中发现的问题是,在第一个字段中,它包含逗号,它们也干扰了CSV解析,并且还通过逗号分隔符起作用.因此,需要对数据进行管理. 请在下面找到我的Python 3源代码:
I would suggest to curate your data, inserting curated data onto Pandas as list of list. The problem I've found with your sample is that, in the first field, it contains comas, which interfere with CSV parsing, working by coma separator as well. Thus, a curations of the data is required. Please, find my source code for Python 3 below:
data = ['Table: Table_1\n',
'\n',
'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
'AST ,27 ,,10-35 U/L ,EN ,\n',
'ALT ,19 ,,9-46 U/L ,EN ,\n',
'\n',
'\n',
'\n',
'\n',
'\n']
lines = [x.replace('\n','') for x in data]
import re
p = re.compile('^[/A-Z ]+[,]*[/A-Z ]*,')
curated_lines = []
for l in lines:
m = p.search(l)
if m != None:
s = m.group(0)
cs = s.replace(',','')
cl = l.replace(s,cs+',')
curated_lines.append(cl)
frame_list_of_list = [l.split(',')[:-1] for l in curated_lines]
import pandas as pd
df = pd.DataFrame(frame_list_of_list,columns=['Test Name','Result','Flag','Reference Range','Lab'])
print(df)
将产生以下结果:
Test Name Result Flag Reference Range Lab
0 HEPATIC FUNCTION PANEL PROTEIN TOTAL 6.1 6.1-8.1 g/dL EN
1 ALBUMIN 4.3 3.6-5.1 g/dL EN
2 GLOBULIN 1.8 LOW 1.9-3.7 g/dL (calc) EN
3 ALBUMIN/GLOBULIN RATIO 2.4 1.0-2.5 (calc) EN
4 BILIRUBIN TOTAL 0.6 0.2-1.2 mg/dL EN
5 BILIRUBIN DIRECT 0.2 < OR = 0.2 mg/dL EN
6 BILIRUBIN INDIRECT 0.4 0.2-1.2 mg/dL (calc) EN
7 ALKALINE PHOSPHATASE 61 40-115 U/L EN
8 AST 27 10-35 U/L EN
9 ALT 19 9-46 U/L EN
这篇关于从混乱的.csv文件中解析/提取表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!