Python - 如何根据分度计将列中的单元格拆分为新行 [英] Python - How to split cell in a column to a new row based of a delmimeter

查看:64
本文介绍了Python - 如何根据分度计将列中的单元格拆分为新行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

相对较新,并尝试从 CSV 文件中使用 python 拆分一些数据.如果出现特定的分隔符,我正在尝试解析此数据并将其拆分为新行.这些分隔符是  '.' ';' 和 '#'.COL_C 中也没有空格.另外,分隔符的顺序也无关紧要,如果我们找到其中一个,则自动创建新行.

Relatively new and trying to split some data with python from a CSV file. I am trying to parse this data and split it into a new row if a specific delimiter appears. Those delimiters are  '.'  ';'  and '#'. There are also no spaces in COL_C. In addition, it wouldn't matter the order of the delimiters, if we find one of them, automatically create the new line.

这是示例数据

Here is the example data

<代码>COL_A |COL_B |COL_C
--------------------
你好 |世界|Hi.Can;You#Help


我想得到的输出是:


the output i'm trying to get would be:

<代码>COL_A |COL_B |COL_C
----------------------
你好 |世界|嗨
你好 |世界|可以
你好 |世界|你
你好 |世界|帮助


示例 2:

COL_A | COL_B | COL_C
----------------------
Hello | World | Hi
Hello | World | Can
Hello | World | You
Hello | World | Help


example 2:

<代码>COL_A |COL_B |COL_C
----------------------
你好 |世界|嗨#123;移动
新 |线|Can.I#parse;this.data




我想要得到的输出是:


the output i'm trying to get would be:

<代码>COL_A |COL_B |COL_C
----------------------
你好 |世界|嗨
你好 |世界|123
你好 |世界|移动
新 |线|可以
新 |线|我
新 |线|解析
新 |线|这个
新 |线|数据



如果这个数据集有另一行没有 Hello World 并且在前两列中有 world hello,我想用相应的第三列的数据解析成新行来显示它.



If this data set had another row without Hello World and had world hello in the first two columns, i would like to display that with the corresponding third column's data parsed out into new rows.

谢谢!

推荐答案

速度与优雅的结合

def pir(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]

    cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
    return df.drop(c, 1).join(cdf).reset_index(drop=True)

忘记优雅,给我速度!

def pir2(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]
    j = df.columns.get_loc(c)
    v = df.values
    n, m = v.shape
    r = np.arange(n).repeat(lens)
    return pd.DataFrame(
        np.column_stack([v[r, 0:j], np.concatenate(clst), v[r, j+1:]]),
        columns=df.columns
    )

<小时>

pir(df, 'COL_C')
# pir2(df, 'COL_C')

   COL_A  COL_B  COL_C
0  Hello  World     Hi
1  Hello  World    123
2  Hello  World   move
3    New   line    Can
4    New   line      I
5    New   line  parse
6    New   line   this
7    New   line   data

<小时>

时间

%timeit pir(df, 'COL_C')
1000 loops, best of 3: 1.42 ms per loop

%timeit pir2(df, 'COL_C')
1000 loops, best of 3: 278 µs per loop

%timeit split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')), lst_cols='COL_C')
100 loops, best of 3: 4.16 ms per loop

%%timeit 
COL_C2 = df.COL_C.str.split('\.|;|#').apply(pd.Series).stack()
df.drop('COL_C', 1).join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C')).reset_index(drop=True)
100 loops, best of 3: 2.81 ms per loop

设置

from io import StringIO
import pandas as pd

txt = """COL_A | COL_B | COL_C
Hello | World | Hi#123;move
New   | line  | Can.I#parse;this.data """

df = pd.read_csv(StringIO(txt), sep='\s*\|\s*', engine='python')

这篇关于Python - 如何根据分度计将列中的单元格拆分为新行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆