根据具有缺少标题的列,将文件标记为大 pandas 数据帧 [英] Tab files into pandas dataframe according to columns with missing headers

查看:113
本文介绍了根据具有缺少标题的列,将文件标记为大 pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将具有空列标题的标签文件转换为数据框?更具体地说,我如何仅使用与相邻未标记列中的字母相对应的值来填充此数据框,在这种情况下为P?

How can I turn a tab file, with empty columns headers, into a dataframe? More specifically, how can I fill this dataframe only with values that correspond with a letter in the adjacent unlabeled column, in this case 'P'?

我使用的标签文件。注意A或P列上没有标题。

This is a representation of the tab file I'm using. Note the lack of headers over the A or P columns.

gene   cell_1      cell_2  
MYC    5.0     P   4.0     A
AKT    3.0     A   1.0     P

所需的数据框将如下所示:

The desired dataframe would look like this:

gene   cell_1   cell_2  
MYC    5.0      NaN
AKT    NaN      1.0

使用熊猫解决这个问题的最佳方法是什么?

What is the best way to tackle this problem using pandas?

推荐答案

我试图实现一些显示花哨索引和掩蔽方法的不同方法。如果您有任何问题,请与我们联系。

I tried to implement a few different approaches that show fancy indexing and masking methods. Let me know if you have any questions

#Load Data
string_data = "gene cell_1  cell_2 \nMYC 5.0 P 4.0 A\nAKT 3.0 A 1.0 P"
A_pre = np.array([row.split(" ") for row in string_data.split("\n")])
DF_data = pd.DataFrame(A_pre[1:,1:],
                       index=pd.Series(A_pre[1:,0],name=A_pre[0,0]),
                       columns=A_pre[0,1:])

A_data = DF_data.as_matrix() #Set the data array b/c it's quicker to slice than DF
rowLabels, colLabels = DF_data.index, DF_data.columns

# #Get blank columns
gene_idx = np.where(np.array(colLabels) != "")[0] #Used later
numColBlank = len(colLabels) - len(gene_idx)

# #Placeholder to fill
DF_placeholder = pd.DataFrame(np.zeros((DF_data.shape[0],DF_data.shape[1] - numColBlank)),
                              index = DF_data.index,
                              columns = DF_data.columns[gene_idx]
                              )
DF_data

#Populate matrix
query = "P"
for i in range(DF_data.shape[0]):
    for j in range(DF_data.shape[1]):
        if colLabels[j] == "":
            if A_data[i,j] == query:
                cell = colLabels[j-1]
                gene = rowLabels[i]
                metric = A_data[i,j-1]
                DF_placeholder.loc[gene,cell] = metric

#I just found out about masks they are useful
mask = DF_placeholder == 0.0
DF_placeholder[mask] = np.nan
DF_processed = DF_placeholder
DF_processed

这篇关于根据具有缺少标题的列,将文件标记为大 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆