使用Python导入包含文本和数字数据的文件 [英] Importing file containing text and numerical data using Python

查看:395
本文介绍了使用Python导入包含文本和数字数据的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 .txt 文件,其中包含文本数据和数字数据.文件的前两行具有文本数据形式的基本信息,而第一列(我将第零列称为第一列)也具有文本形式的基本数据.在文件的所有其他位置,数据均为数字形式.我希望使用python中的库(最好是numpy或pandas或两者结合使用)来分析文件中存在的数值数据(分析,例如回归,相关性,scikit-learn等).我重申文件中的所有数据对于我的分析是必不可少的.以下快照(从Excel中获取)显示了我的数据所在格式的截短版本:

I have a .txt file which has text data and numerical data. The first two rows of the file have essential information in text data form, while the first column (I am referring to the zeroth column as the first column) also has essential data in text form. At all other locations in the file, the data is in numerical form. I wish to analyze the numerical data present in the file using libraries in python ,preferably numpy or pandas, or a combination of both (analysis like regression, correlation, scikit-learn etc). I reiterate that all of the data in the file is essential for my analysis. The following snapshot (taken from Excel) shows a truncated version of the format in which my data is in:

可以在此处找到该快照中显示的数据.

特别是,我想要的是能够使用python(numpy或pandas)从该文件中导入所有数值数据,并能够使用前两行中的文本数据引用此数据中的特定行(类型,标签)和第一列(对象编号).在我的实际数据文件中,我有成千上万的行(对象类型)和数十列.

In particular, what I want is to be able to import all the numerical data from this file using python (numpy or pandas), and be able to refer to specific rows in this data using the text data in the first two rows (Type, Tag) and the first column (object number). In my actual data file, I have hundreds of thousands of rows (object types) and scores of columns.

我已经尝试使用numpy.loadtxt(...)pandas.read_csv(...)打开此文件,但是我遇到了错误,或者以笨拙的格式加载了数据.对于如何以某种方式在python中导入文件,以使其具有所需的功能,我将深表感激.

I have already tried using numpy.loadtxt(...) and pandas.read_csv(...) to open this file, but I have either run into errors, or have loaded data in clumsy formats. I will be really thankful to have some direction as to how I can import the file in python in a way so that I have the functionality that I desire.

推荐答案

如果我是你,我会使用pandas,然后使用类似以下内容的方法将其导入:

If I were you, I would use pandas, and import it using something like this:

df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)

这为您提供了数据框:

>>> df
Type      T1   T2   T3   T4   T5
Tag     Good Good Good Good Good
object1  1.1  2.1  3.1  4.1  5.1
object2  1.2  2.2  3.2  4.2  5.2
object3  1.3  2.3  3.3  4.3  5.3
object4  1.4  2.4  3.4  4.4  5.4
object5  1.5  2.5  3.5  4.5  5.5
object6  1.6  2.6  3.6  4.6  5.6
object7  1.7  2.7  3.7  4.7  5.7
object8  1.8  2.8  3.8  4.8  5.8

您所有的列都是浮点数:

And all of your columns are floats:

>>> df.dtypes
Type  Tag 
T1    Good    float64
T2    Good    float64
T3    Good    float64
T4    Good    float64
T5    Good    float64
dtype: object

它包含一个多索引的列标题:

It contains a multi-indexed column header:

>>> df.columns
MultiIndex(levels=[['T1', 'T2', 'T3', 'T4', 'T5'], ['Good']],
           labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
           names=['Type', 'Tag'])

以及包含来自Type的信息的常规索引:

And a regular index containing the information from Type:

>>> df.index
Index(['object1', 'object2', 'object3', 'object4', 'object5', 'object6',
       'object7', 'object8'],
      dtype='object')

此外,您可以使用以下方法将值转换为floatsnumpy数组:

Furthermore, you can convert your values to a numpy array of floats simply by using:

>>> df.values
array([[1.1, 2.1, 3.1, 4.1, 5.1],
       [1.2, 2.2, 3.2, 4.2, 5.2],
       [1.3, 2.3, 3.3, 4.3, 5.3],
       [1.4, 2.4, 3.4, 4.4, 5.4],
       [1.5, 2.5, 3.5, 4.5, 5.5],
       [1.6, 2.6, 3.6, 4.6, 5.6],
       [1.7, 2.7, 3.7, 4.7, 5.7],
       [1.8, 2.8, 3.8, 4.8, 5.8]])

这篇关于使用Python导入包含文本和数字数据的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆