根据具有缺少标题的列，将文件标记为大 pandas 数据帧 [英] Tab files into pandas dataframe according to columns with missing headers

查看：113 发布时间：2017/3/26 3:18:30 python pandas dataframe

本文介绍了根据具有缺少标题的列，将文件标记为大 pandas 数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何将具有空列标题的标签文件转换为数据框？更具体地说，我如何仅使用与相邻未标记列中的字母相对应的值来填充此数据框，在这种情况下为P？

How can I turn a tab file, with empty columns headers, into a dataframe? More specifically, how can I fill this dataframe only with values that correspond with a letter in the adjacent unlabeled column, in this case 'P'?

我使用的标签文件。注意A或P列上没有标题。

This is a representation of the tab file I'm using. Note the lack of headers over the A or P columns.

gene   cell_1      cell_2  
MYC    5.0     P   4.0     A
AKT    3.0     A   1.0     P

所需的数据框将如下所示：

The desired dataframe would look like this:

gene   cell_1   cell_2  
MYC    5.0      NaN
AKT    NaN      1.0

使用熊猫解决这个问题的最佳方法是什么？

What is the best way to tackle this problem using pandas?

推荐答案

我试图实现一些显示花哨索引和掩蔽方法的不同方法。如果您有任何问题，请与我们联系。

I tried to implement a few different approaches that show fancy indexing and masking methods. Let me know if you have any questions

#Load Data
string_data = "gene cell_1  cell_2 \nMYC 5.0 P 4.0 A\nAKT 3.0 A 1.0 P"
A_pre = np.array([row.split(" ") for row in string_data.split("\n")])
DF_data = pd.DataFrame(A_pre[1:,1:],
                       index=pd.Series(A_pre[1:,0],name=A_pre[0,0]),
                       columns=A_pre[0,1:])

A_data = DF_data.as_matrix() #Set the data array b/c it's quicker to slice than DF
rowLabels, colLabels = DF_data.index, DF_data.columns

# #Get blank columns
gene_idx = np.where(np.array(colLabels) != "")[0] #Used later
numColBlank = len(colLabels) - len(gene_idx)

# #Placeholder to fill
DF_placeholder = pd.DataFrame(np.zeros((DF_data.shape[0],DF_data.shape[1] - numColBlank)),
                              index = DF_data.index,
                              columns = DF_data.columns[gene_idx]
                              )
DF_data

#Populate matrix
query = "P"
for i in range(DF_data.shape[0]):
    for j in range(DF_data.shape[1]):
        if colLabels[j] == "":
            if A_data[i,j] == query:
                cell = colLabels[j-1]
                gene = rowLabels[i]
                metric = A_data[i,j-1]
                DF_placeholder.loc[gene,cell] = metric

#I just found out about masks they are useful
mask = DF_placeholder == 0.0
DF_placeholder[mask] = np.nan
DF_processed = DF_placeholder
DF_processed

这篇关于根据具有缺少标题的列，将文件标记为大 pandas 数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据具有缺少标题的列，将文件标记为大 pandas 数据帧 [英] Tab files into pandas dataframe according to columns with missing headers

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

根据具有缺少标题的列，将文件标记为大 pandas 数据帧 [英] Tab files into pandas dataframe according to columns with missing headers

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭