读取选项卡分隔的文件，第一列为键，其余为值 [英] Read a tab separated file with first column as key and the rest as values

查看：218 发布时间：2017/2/24 22:30:25 python csv numpy dictionary pandas

本文介绍了读取选项卡分隔的文件，第一列为键，其余为值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个标签分隔文件，其中包含 10亿行（Imagine 200列，而不是3）：

  abc -0.123 0.6524 0.325 
 foo -0.9808 0.874 -0.2341 
 bar 0.23123 -0.123124 -0.1232

我想创建一个字典，其中第一列中的字符串是键，其余的是值。我一直这样做，但它的计算昂贵：

  import io 
 
 dictionary = { } 
 
 with io.open（'bigfile'，'r'）as fin：
 for line in fin：
 kv = line.strip（）。split b $ bk，v = kv [0]，kv [1：] 
 dictionary [k] = list（map（float，v））

我还能如何获得所需的字典？实际上numpy数组将比该值的浮点列表更合适。

解决方案

您可以使用pandas加载df，然后根据需要构造一个新的df，然后调用 to_dict ：

 在[99]：
 
t = abc -0.123 0.6524 0.325 
 foo -0.9808 0.874 -0.2341 
 bar 0.23123 -0.123124 -0.1232
 df = pd.read_csv（io.StringIO（t），sep ='\ s +'，header = None）
 df = pd.DataFrame（columns = df [0]，data = df.ix [：，1：]。values）
 df.to_dict（）
 Out [99]：
 {'abc'：{0：-0.12300000000000001，
 1：-0.98080000000000001，
 2：0.23123000000000002}，
'bar' 0.32500000000000001，1：-0.2341，2：-0.1232}，
'foo'：{0：0.65239999999999998，1：0.87400000000000011，2：-0.123124}}

EDIT

更动态的方法，构造一个临时df：

 在[121]：
 
t =abc -0.123 0.6524 0.325 
 foo -0.9808 0.874 -0.2341 
 bar 0.23123 -0.123124 -0.1232
＃确定cols的数量，我们将在usecols中使用它
 col_len = pd。 read_csv（io.StringIO（t），sep ='\s +'，nrows = 1）.shape [1] 
 col_len 
＃读取第一个col我们将在名称中使用
 cols = pd.read_csv（io.StringIO（t），sep ='\s +'，usecols = [0]，header = None）[0] .values 
＃现在读取并构造df确定的usecols和上面的名称
 df = pd.read_csv（io.StringIO（t），sep ='\s +'，header = None，usecols = list（range（1，col_len）），names = cols ）
 df.to_dict（）
 Out [121]：
 {'abc'：{0：-0.12300000000000001，
 1：-0.98080000000000001，
 2：0.23123000000000002 }，
'bar'：{0：0.32500000000000001，1：-0.2341，2：-0.1232}，
'foo'：{0：0.65239999999999998,1：0.87400000000000011，2：-0.123124}}

进一步更新

实际上，您不需要第一次读取，列长度可以通过第一列中的列数隐式导出：

  In [128]：
 
t =abc -0.123 0.6524 0.325 
 foo -0.9808 0.874 -0.2341 
 bar 0.23123 -0.123124 -0.1232
 cols = pd.read_csv（io.StringIO（t），sep ='\s +'，usecols = [0]，header = None）[0] .values 
 df = pd.read_csv .StringIO（t），sep ='\s +'，header = None，usecols = list（range（1，len（cols）+1）），names = cols）
 df.to_dict（）
 Out [128]：
 {'abc'：{0：-0.12300000000000001，
 1：-0.98080000000000001，
 2：0.23123000000000002}，
'bar'：{0 ：0.32500000000000001,1：-0.2341,2：-0.1232}，
'foo'：{0：0.65239999999999998,1：0.87400000000000011，2：-0.123124}} 
  pre> 
I have a tab separated file with 1 billion lines of these (Imagine 200 columns, instead of 3):
abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232
I want to create a dictionary where the string in the first column is the key and the rest are the values. I've been doing it like this but it's computationally expensive:
import io

dictionary = {}

with io.open('bigfile', 'r') as fin:
    for line in fin:
        kv = line.strip().split()
        k, v = kv[0], kv[1:]
        dictionary[k] = list(map(float, v))
How else can I do get the desired dictionary? Actually a numpy array would be more appropriate than a list of floats for the value. 
 解决方案 
You can use pandas to load the df, then construct a new df as desired and then call to_dict:
In [99]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values)
df.to_dict()
Out[99]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
EDIT

A more dynamic method and one which would reduce the need to construct a temporary df:
In [121]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
# determine the number of cols, we'll use this in usecols
col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1]
col_len
# read the first col we'll use this in names
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
# now read and construct the df using the determined usecols and names from above
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols)
df.to_dict()
Out[121]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
Further update

Actually you don't need the first read, the column length can be implicitly derived by the number of columns in the first column anyway:
In [128]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols)
df.to_dict()
Out[128]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}


                        
这篇关于读取选项卡分隔的文件，第一列为键，其余为值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

读取选项卡分隔的文件，第一列为键，其余为值 [英] Read a tab separated file with first column as key and the rest as values

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

读取选项卡分隔的文件，第一列为键，其余为值 [英] Read a tab separated file with first column as key and the rest as values

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭