根据具有缺少标题的列,将文件标记为大 pandas 数据帧 [英] Tab files into pandas dataframe according to columns with missing headers
问题描述
如何将具有空列标题的标签文件转换为数据框?更具体地说,我如何仅使用与相邻未标记列中的字母相对应的值来填充此数据框,在这种情况下为P?
How can I turn a tab file, with empty columns headers, into a dataframe? More specifically, how can I fill this dataframe only with values that correspond with a letter in the adjacent unlabeled column, in this case 'P'?
我使用的标签文件。注意A或P列上没有标题。
This is a representation of the tab file I'm using. Note the lack of headers over the A or P columns.
gene cell_1 cell_2
MYC 5.0 P 4.0 A
AKT 3.0 A 1.0 P
所需的数据框将如下所示:
The desired dataframe would look like this:
gene cell_1 cell_2
MYC 5.0 NaN
AKT NaN 1.0
使用熊猫解决这个问题的最佳方法是什么?
What is the best way to tackle this problem using pandas?
推荐答案
我试图实现一些显示花哨索引和掩蔽方法的不同方法。如果您有任何问题,请与我们联系。
I tried to implement a few different approaches that show fancy indexing and masking methods. Let me know if you have any questions
#Load Data
string_data = "gene cell_1 cell_2 \nMYC 5.0 P 4.0 A\nAKT 3.0 A 1.0 P"
A_pre = np.array([row.split(" ") for row in string_data.split("\n")])
DF_data = pd.DataFrame(A_pre[1:,1:],
index=pd.Series(A_pre[1:,0],name=A_pre[0,0]),
columns=A_pre[0,1:])
A_data = DF_data.as_matrix() #Set the data array b/c it's quicker to slice than DF
rowLabels, colLabels = DF_data.index, DF_data.columns
# #Get blank columns
gene_idx = np.where(np.array(colLabels) != "")[0] #Used later
numColBlank = len(colLabels) - len(gene_idx)
# #Placeholder to fill
DF_placeholder = pd.DataFrame(np.zeros((DF_data.shape[0],DF_data.shape[1] - numColBlank)),
index = DF_data.index,
columns = DF_data.columns[gene_idx]
)
DF_data
#Populate matrix
query = "P"
for i in range(DF_data.shape[0]):
for j in range(DF_data.shape[1]):
if colLabels[j] == "":
if A_data[i,j] == query:
cell = colLabels[j-1]
gene = rowLabels[i]
metric = A_data[i,j-1]
DF_placeholder.loc[gene,cell] = metric
#I just found out about masks they are useful
mask = DF_placeholder == 0.0
DF_placeholder[mask] = np.nan
DF_processed = DF_placeholder
DF_processed
这篇关于根据具有缺少标题的列,将文件标记为大 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!