如何将数据帧列分成多个列 [英] How to split a dataframe column into multiple columns
问题描述
经过多番努力,我开始将我的R脚本迁移到Python。我在R中的大部分工作都涉及数据框架,我使用的是来自pandas包的 DataFrame
对象。在我的脚本中,我需要读入一个csv文件,并将数据导入到一个 DataFrame
对象。接下来,我需要将十六进制值转换为标记为 DATA
的列到按位数据,然后创建16个新列,每个位一个。
我在文件 test.txt
中的输入数据示例如下,
PREFIX,TEST,ZONE,ROW,COL,DATA
6_6,READ,0,0,0,BFED
6_6,READ,0,1,0,BB7D
6_6,READ,0,2,0,FFF7
6_6,READ,0,3,0,E7FF
6_6,READ,0,4,0,FBF8
6_6,READ,0,5,0,DE75
6_6,READ,0,6,0,DFFE
我的python脚本 test.py
如下,
import glob
import pandas as pd
import numpy as np
fname ='test.txt'
df = pd.read_csv(fname,comment =#)
dfs = df [df.TEST =='READ ']
#函数将hexstring转换为二进制字符串
def hex2bin(hstr):
return bin(int(hstr,16 )[2:]
#将列DATA中的hexstring转换为binarystring ROWDATA
dfs ['BINDATA'] = dfs ['DATA']。 apply(hex2bin)
#删除列DATA
del dfs ['DATA']
当我运行这个脚本,并检查对象 dfs
,我得到以下,
PREFIX TEST ZONE ROW COL BINDATA
0 6_6 READ 0 0 0 1011111111101101
1 6_6 READ 0 1 0 1011101101111101
2 6_6 READ 0 2 0 1111111111110111
3 6_6 READ 0 3 0 1110011111111111
4 6_6 READ 0 4 0 1111101111111000
5 6_6 READ 0 5 0 1101111001110101
6 6_6 READ 0 6 0 1101111111111110
将名为 BINDATA
的列拆分为16个新列(可命名为B0,B0,B2,...,B15)。任何帮助将不胜感激。
谢谢&
我不知道是否它可以做得更简单(没有for循环),但这是诀窍:
for i in range(16)
dfs ['B'+ str(i)] = dfs ['BINDATA']。str [i]
b $ b
本系列的 str
属性允许访问一些对每个元素起作用的矢量化字符串方法(参见docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-字符串方法)。在这种情况下,我们只是索引字符串以访问不同的字符。
这给我:
[20]:dfs
Out [20]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
如果你想要它们为int而不是字符串,你可以添加 .astype(int)
编辑:另一种方法(一个工作,但你必须更改列名第二步):
在[34]:splitted = dfs ['BINDATA']。apply(lambda x:pd。系列(列表(x)))
In [35]:splitted.columns = ['B'+ str(x)for x in splitted.columns]
[36]:dfs.join(splitted)
Out [36]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
After much prodding I am starting migrating my R scripts to Python. Most of my work in R involved data frames, and I am using the DataFrame
object from the pandas package. In my script I need to read in a csv file and import the data into a DataFrame
object. Next I need to convert the hex values into a column labelled DATA
into bitwise data, and then create 16 new columns, one for each bit.
My example input data in file test.txt
looks as follows,
PREFIX,TEST,ZONE,ROW,COL,DATA
6_6,READ,0, 0, 0,BFED
6_6,READ,0, 1, 0,BB7D
6_6,READ,0, 2, 0,FFF7
6_6,READ,0, 3, 0,E7FF
6_6,READ,0, 4, 0,FBF8
6_6,READ,0, 5, 0,DE75
6_6,READ,0, 6, 0,DFFE
My python script test.py
is as follows,
import glob
import pandas as pd
import numpy as np
fname = 'test.txt'
df = pd.read_csv(fname, comment="#")
dfs = df[df.TEST == 'READ']
# function to convert the hexstring into a binary string
def hex2bin(hstr):
return bin(int(hstr,16))[2:]
# convert the hexstring in column DATA to binarystring ROWDATA
dfs['BINDATA'] = dfs['DATA'].apply(hex2bin)
# get rid of the column DATA
del dfs['DATA']
When I run this script, and inspect the object dfs
, I get the following,
PREFIX TEST ZONE ROW COL BINDATA
0 6_6 READ 0 0 0 1011111111101101
1 6_6 READ 0 1 0 1011101101111101
2 6_6 READ 0 2 0 1111111111110111
3 6_6 READ 0 3 0 1110011111111111
4 6_6 READ 0 4 0 1111101111111000
5 6_6 READ 0 5 0 1101111001110101
6 6_6 READ 0 6 0 1101111111111110
So now I am not sure how to split the column named BINDATA
into 16 new columns (could be named B0, B0, B2, ...., B15). Any help will be appreciated.
Thanks & Regards,
Derric.
I don't know if it can be done simpler (without the for loop), but this does the trick:
for i in range(16):
dfs['B'+str(i)] = dfs['BINDATA'].str[i]
The str
attribute of the Series gives access to some vectorized string methods which act upon each element (see docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods). In this case we just index the string to acces the different characters.
This gives me:
In [20]: dfs
Out[20]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
If you want them as ints instead of strings, you can add .astype(int)
in the for loop.
EDIT: Another way to do it (a oneliner, but you have to change the column names in a second step):
In [34]: splitted = dfs['BINDATA'].apply(lambda x: pd.Series(list(x)))
In [35]: splitted.columns = ['B'+str(x) for x in splitted.columns]
In [36]: dfs.join(splitted)
Out[36]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
这篇关于如何将数据帧列分成多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!