Python - 如何根据分度计将列中的单元格拆分为新行 [英] Python - How to split cell in a column to a new row based of a delmimeter
问题描述
相对较新,并尝试从 CSV 文件中使用 python 拆分一些数据.如果出现特定的分隔符,我正在尝试解析此数据并将其拆分为新行.这些分隔符是 '.' ';' 和 '#'.COL_C 中也没有空格.另外,分隔符的顺序也无关紧要,如果我们找到其中一个,则自动创建新行.
Relatively new and trying to split some data with python from a CSV file. I am trying to parse this data and split it into a new row if a specific delimiter appears. Those delimiters are '.' ';' and '#'. There are also no spaces in COL_C. In addition, it wouldn't matter the order of the delimiters, if we find one of them, automatically create the new line.
这是示例数据
Here is the example data
<代码>COL_A |COL_B |COL_C--------------------
你好 |世界|Hi.Can;You#Help
我想得到的输出是:
the output i'm trying to get would be:
<代码>COL_A |COL_B |COL_C----------------------
你好 |世界|嗨
你好 |世界|可以
你好 |世界|你
你好 |世界|帮助
示例 2:
COL_A | COL_B | COL_C
----------------------
Hello | World | Hi
Hello | World | Can
Hello | World | You
Hello | World | Help
example 2:
<代码>COL_A |COL_B |COL_C----------------------
你好 |世界|嗨#123;移动
新 |线|Can.I#parse;this.data
我想要得到的输出是:
the output i'm trying to get would be:
<代码>COL_A |COL_B |COL_C----------------------
你好 |世界|嗨
你好 |世界|123
你好 |世界|移动
新 |线|可以
新 |线|我
新 |线|解析
新 |线|这个
新 |线|数据
如果这个数据集有另一行没有 Hello World 并且在前两列中有 world hello,我想用相应的第三列的数据解析成新行来显示它.
If this data set had another row without Hello World and had world hello in the first two columns, i would like to display that with the corresponding third column's data parsed out into new rows.
谢谢!
推荐答案
速度与优雅的结合
def pir(df, c):
colc = df[c].str.split('\.|;|#')
clst = colc.values.tolist()
lens = [len(l) for l in clst]
cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
return df.drop(c, 1).join(cdf).reset_index(drop=True)
忘记优雅,给我速度!
def pir2(df, c):
colc = df[c].str.split('\.|;|#')
clst = colc.values.tolist()
lens = [len(l) for l in clst]
j = df.columns.get_loc(c)
v = df.values
n, m = v.shape
r = np.arange(n).repeat(lens)
return pd.DataFrame(
np.column_stack([v[r, 0:j], np.concatenate(clst), v[r, j+1:]]),
columns=df.columns
)
<小时>
pir(df, 'COL_C')
# pir2(df, 'COL_C')
COL_A COL_B COL_C
0 Hello World Hi
1 Hello World 123
2 Hello World move
3 New line Can
4 New line I
5 New line parse
6 New line this
7 New line data
<小时>
时间
%timeit pir(df, 'COL_C')
1000 loops, best of 3: 1.42 ms per loop
%timeit pir2(df, 'COL_C')
1000 loops, best of 3: 278 µs per loop
%timeit split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')), lst_cols='COL_C')
100 loops, best of 3: 4.16 ms per loop
%%timeit
COL_C2 = df.COL_C.str.split('\.|;|#').apply(pd.Series).stack()
df.drop('COL_C', 1).join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C')).reset_index(drop=True)
100 loops, best of 3: 2.81 ms per loop
设置
from io import StringIO
import pandas as pd
txt = """COL_A | COL_B | COL_C
Hello | World | Hi#123;move
New | line | Can.I#parse;this.data """
df = pd.read_csv(StringIO(txt), sep='\s*\|\s*', engine='python')
这篇关于Python - 如何根据分度计将列中的单元格拆分为新行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!