列标题中的多个定界符还分隔行值 [英] Multiple delimiters in column headers also separates the row values

查看:110
本文介绍了列标题中的多个定界符还分隔行值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在读取文件时定义多个分隔符存在一些问题.最初在我以前的帖子中解决了该问题

I had a some issue about defining multiple seperator when reading a file. It is originally solved in my previous post reading-files-with-multiple-delimiter-in-column-headers-and-skipping-some-rows thanks to @piRsquared

当我详细查看我的真实数据时,我 意识到有些列具有.cd或.dvd扩展名,当我应用上面的解决方案时,它们也被分隔为新列,并且上面的解决方案开始无法正常工作!

When I looked in detail to my real data I realized that some columns have .cd or .dvd extensions and when I applied the solution above they are also separated as a new column and the solution above started not to work!

b.txt

skip1
 A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9

END123
Some other data starts from here

并使用 A5列应具有行

5.cd
5.cd
5.cd

与A9列相同

8.dvd
8.dvd
8.dvd

我们应该有A9列,但似乎由于这种冲突而消失了.

and we should have A9 column but seems that it disappears due to this conflict.

我将真实身份与我的真实数据相类似

I put almost similar identity to my real data

 skip rows
 A1| A2| A3|A4# A5#  |  A6 | A7  , A8,  A9  | A10 |
 1 | 2 | 3 |4 # 5 #  | 6.cd|7.dvd,   ,      | 10  | 
 1 | 2 | 3 |4 # 5 #  | 6.cd|     ,   ,   9  | 10  |
 1 | 2 | 3 |4 # 5 #  |     |7.dvd,   ,      | 10  |

END123
Some other data starts from here

并尝试

txt = open('real_dat.txt').read().split('\nEND')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

df=pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names,skiprows=1,index_col=False,
    engine='python')

并获得此输出!

推荐答案

更新后的答案
摆脱空间更容易...让我知道这是否可行

Updated Answer
It was just easier to get rid of spaces... Let me know if this works

txt = open('b.txt').read().split('\nEND')[0] \
    .replace(' ', '').replace('|\n', '\n').split('\n', 1)[1]

pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'#\||\||#|,',
    engine='python')

   A1  A2  A3  A4  A5    A6     A7  A8   A9  A10
0   1   2   3   4   5  6.cd  7.dvd NaN  NaN   10
1   1   2   3   4   5  6.cd    NaN NaN  9.0   10
2   1   2   3   4   5   NaN  7.dvd NaN  NaN   10

旧答案

我使用\W+作为解析您显示内容的快速简便的方法.下面,我使用了一些特定于您所需的实际定界符的东西.

I used \W+ as a fast and easy way to parse what you showed. Below I used something more specific to the actual delimiters you need.

txt = open('b.txt').read().split('\nEND')[0]
pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'[\|, ,#,\,]+',
    skiprows=1,index_col=False, engine='python')

   A1  A2  A3  A4    A5  A6  A7     A8  A9
0   1   2   3   4  5.cd   6   7  8.dvd   9
1   1   2   3   4  5.cd   6   7  8.dvd   9
2   1   2   3   4  5.cd   6   7  8.dvd   9


但是,我仍然认为这是一种更清洁的方法.在这里,我将头文件的解析与其余数据的解析分开.这样,我假设数据应仅使用,作为分隔符.


However, I still think this is a cleaner way to do it. Here, I separate the parsing of the header from the parsing of the rest of the data. That way, I assume the data should only be using , as a separator.

txt = open('b.txt').read().split('END')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names, header=None,
    engine='python')

   A1  A2  A3  A4    A5  A6  A7     A8  A9
0   1   2   3   4  5.cd   6   7  8.dvd   9
1   1   2   3   4  5.cd   6   7  8.dvd   9
2   1   2   3   4  5.cd   6   7  8.dvd   9

这篇关于列标题中的多个定界符还分隔行值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆