如何解析非结构化的表状数据? [英] How to parse unstructured table-like data?

查看:203
本文介绍了如何解析非结构化的表状数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个text file,其中保存了一些运算结果.数据显示在human-readable format中(如表).如何解析此数据,以便可以使用此数据形成诸如dictionaries之类的数据结构?

I have a text file that holds some result of an operation. The data is displayed in a human-readable format (like a table). How do I parse this data so that I can form a data structure such as dictionaries with this data?

unstructured data的示例如下所示.

===============================================================
Title
===============================================================
Header     Header Header Header  Header       Header
1          2      3      4       5            6                   
---------------------------------------------------------------
1          Yes    No     6       0001 0002    True    
2          No     Yes    7       0003 0004    False    
3          Yes    No     6       0001 0001    True    
4          Yes    No     6       0001 0004    False    
4          No     No     4       0004 0004    True    
5          Yes    No     2       0001 0001    True    
6          Yes    No     1       0001 0001    False    
7          No     No     2       0004 0004    True

上例中显示的数据不是tab-separatedcomma separated.它始终具有header,并且相应地沿/c7外观可能/可能不具有值.

The data displayed in the above example is not tab-separated or comma separated. It always has a header and correspondingly may/may not have values along the column-like appearance.

我尝试使用诸如regexconditional checks之类的基本解析技术,但是我需要一种更健壮的方法来解析此数据,因为上面显示的示例并不是呈现数据的唯一格式.

I have tried using basic parsing techniques such as regex and conditional checks, but I need a more robust way to parse this data as the above shown example is not the only format the data gets rendered.

更新1 :除了显示的示例外,还有许多情况,例如添加更多列,单个单元格具有多个实例(但在下一行中以可视方式显示,而属于前一个)排).

Update 1: There are many cases apart from the shown example such as addition of more columns, single cell having more than one instance (but shown visually in next row, whereas it belongs to the previous row).

有没有python库来解决此问题?

Is there any python library to solve this problem?

machine learning技术可以在不解析的情况下帮助解决此问题吗?如果是,那是什么类型的问题(分类,回归,聚类)?

Can machine learning techniques help in this problem without parsing? If yes, what type of problem would it be (Classification, Regression, Clustering)?

===============================================================
Title
===============================================================
Header     Key_1   Header Header  Header       Header
1          Key_2   3      4       5            6                   
---------------------------------------------------------------
1          Value1  No     6       0001 0002    True
           Value2    
2          Value1  Yes    7       0003 0004    False    
           Value2
3          Value1  No     6       0001 0001    True    
           Value2
4          Value1  No     6       0001 0004    False    
           Value2  
5          Value1  No     4       0004 0004    True    
           Value2  
6          Value1  No     2       0001 0001    True    
           Value2  
7          Value1  No     1       0001 0001    False    
           Value2  
8          Value1  No     2       0004 0004    True
           Value2  

更新2 :它的另一个示例涉及一个具有多个实例的单个单元格(但在下一行中以可视方式显示,而它属于上一行).

Update 2: Another example of what it might look like which involves a single cell having more than one instance (but shown visually in next row, whereas it belongs to the previous row).

推荐答案

假设您的示例是"sample.txt".

Say your example is 'sample.txt'.

import pandas as pd

df = pd.read_table('sample.txt', skiprows=[0, 1, 2, 3, 5], delimiter='\s\s+')

print(df)
print(df.shape)

   1    2    3  4          5      6
0  1  Yes   No  6  0001 0002   True
1  2   No  Yes  7  0003 0004  False
2  3  Yes   No  6  0001 0001   True
3  4  Yes   No  6  0001 0004  False
4  4   No   No  4  0004 0004   True
5  5  Yes   No  2  0001 0001   True
6  6  Yes   No  1  0001 0001  False
7  7   No   No  2  0004 0004   True
(8, 6)

您当然可以更改数据类型.请检查pd.read_table()参数的数量.另外,对于xlsx,csv,html,sql, json,hdf,甚至剪贴板等.

You can change the data types of course. Please check tons of params of pd.read_table(). Also, there are method for xlsx, csv, html, sql, json, hdf, even clipboard, etc.

欢迎使用熊猫 ...

这篇关于如何解析非结构化的表状数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆