pandas 可以将长度可变的空白作为列定界符吗 [英] Can pandas handle variable-length whitespace as column delimiters

查看：105 发布时间：2020/5/23 22:17:55 python pandas

本文介绍了 pandas 可以将长度可变的空白作为列定界符吗的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文本文件，其中的列由可变数量的空格分隔.是否可以在不进行预处理的情况下直接将该文件作为pandas数据框加载?在 pandas文档中，定界符部分表示我可以使用一个's*'构造，但我无法使它正常工作.

I have a textfile where columns are separated by variable amounts of whitespace. Is it possible to load this file directly as a pandas dataframe without pre-processing the file? In the pandas documentation the delimiter section says that I can use a 's*' construct but I couldn't get this to work.

## sample data
head sample.txt

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
ABC_membrane         PF00664.18   275 AAF67494.2_AF170880  -            615     8e-29  100.7  11.4   1   1     3e-32     1e-28  100.4   7.9     3   273    42   313    40   315 0.95 ABC transporter transmembrane region
ABC_tran             PF00005.22   118 AAF67494.2_AF170880  -            615   2.6e-20   72.8   0.0   1   1   1.9e-23   6.4e-20   71.5   0.0     1   118   402   527   402   527 0.93 ABC transporter
SMC_N                PF02463.14   220 AAF67494.2_AF170880  -            615   3.8e-08   32.7   0.2   1   2    0.0036        12    4.9   0.0    27    40   391   404   383   408 0.86 RecF/RecN/SMC N terminal domain
SMC_N                PF02463.14   220 AAF67494.2_AF170880  -            615   3.8e-08   32.7   0.2   2   2   1.8e-09   6.1e-06   25.4   0.0   116   210   461   568   428   575 0.85 RecF/RecN/SMC N terminal domain
AAA_16               PF13191.1    166 AAF67494.2_AF170880  -            615   3.1e-06   27.5   0.3   1   1     2e-09     7e-06   26.4   0.2    20   158   386   544   376   556 0.72 AAA ATPase domain
YceG                 PF02618.11   297 AAF67495.1_AF170880  -            284   3.4e-64  216.6   0.0   1   1   2.9e-68     4e-64  216.3   0.0    68   296    53   274    29   275 0.85 YceG-like family
Pyr_redox_3          PF13738.1    203 AAF67496.2_AF170880  -            352   2.9e-28   99.1   0.0   1   2   2.8e-30   4.8e-27   95.2   0.0     1   201     4   198     4   200 0.85 Pyridine nucleotide-disulphide oxidoreductase

#load data
from pandas import *
data = read_table('sample.txt', skiprows=3, header=None, sep=" ")

ValueError: Expecting 83 columns, got 91 in row 4

#load data part 2
data = read_table('sample.txt', skiprows=3, header=None, sep="'s*' ")
#this mushes some of the columns into the first column and drops the rest.
    X.1
1    ABC_tran PF00005.22 118 AAF67494.2_
2    SMC_N PF02463.14 220 AAF67494.2_
3    SMC_N PF02463.14 220 AAF67494.2_
4    AAA_16 PF13191.1 166 AAF67494.2_
5    YceG PF02618.11 297 AAF67495.1_
6    Pyr_redox_3 PF13738.1 203 AAF67496.2_
7    Pyr_redox_3 PF13738.1 203 AAF67496.2_
8    FMO-like PF00743.14 532 AAF67496.2_
9    FMO-like PF00743.14 532 AAF67496.2_

虽然我可以预处理文件以将空格更改为逗号/制表符，但直接加载它们会很好.

While I can preprocess the files to change the whitespace to commas/tabs it would be nice to load them directly.

(仅供参考，这是 hmmscan程序的* .hmmdomtblout输出)

(FYI this is the *.hmmdomtblout output from the hmmscan program)

推荐答案

我认为文档中只缺少一个\(也许是因为它在某些时候被解释为转义标记吗?)毕竟是正则表达式:

I think there's just a missing \ in the docs (maybe because it was interpreted as an escape marker at some point?) It's a regexp, after all:

In [68]: data = read_table('sample.txt', skiprows=3, header=None, sep=r"\s*")

In [69]: data
Out[69]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns:
X.1     7  non-null values
X.2     7  non-null values
X.3     7  non-null values
X.4     7  non-null values
X.5     7  non-null values
X.6     7  non-null values
[...]
X.23    7  non-null values
X.24    7  non-null values
X.25    5  non-null values
X.26    3  non-null values
dtypes: float64(8), int64(10), object(8)

由于@MRAB指出了定界符问题，因此最后几列存在一些麻烦:

Because of the delimiter problem noted by @MRAB, it has some trouble with the last few columns:

In [73]: data.ix[:,20:]
Out[73]: 
   X.21  X.22           X.23                   X.24            X.25    X.26
0   315  0.95            ABC            transporter   transmembrane  region
1   527  0.93            ABC            transporter            None    None
2   408  0.86  RecF/RecN/SMC                      N        terminal  domain
3   575  0.85  RecF/RecN/SMC                      N        terminal  domain
4   556  0.72            AAA                 ATPase          domain    None
5   275  0.85      YceG-like                 family            None    None
6   200  0.85       Pyridine  nucleotide-disulphide  oxidoreductase    None

但是可以在末尾进行修补.

but that can be patched up at the end.

这篇关于 pandas 可以将长度可变的空白作为列定界符吗的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 可以将长度可变的空白作为列定界符吗 [英] Can pandas handle variable-length whitespace as column delimiters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 可以将长度可变的空白作为列定界符吗 [英] Can pandas handle variable-length whitespace as column delimiters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭