如何将包含多个表的.dat文件读取到 pandas 数据框中? [英] How can I read .dat file containing multiple tables into a pandas data frame?

查看:134
本文介绍了如何将包含多个表的.dat文件读取到 pandas 数据框中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个测量设备,它可以将数据记录在此文件夹的raw_data.dat文件中。 ,它们都具有相同的结构,并且我希望能够将文件中的最后一个表提取到pandas数据框中。

I have a measuring device that records data in .dat files like raw_data.dat in this folder, all with the same structure, and I want to be able to extract the last table in the file into a pandas data frame.

该文件有几个表,我我不确定这里的列表结构是否是.dat文件的标准格式,但是我尝试将文本粘贴到excel中,并且它会将文本识别为单独的表,因此可能存在一种相当标准的方法来将结构正确读取到python中。我找不到一个,因此尝试了一种非常复杂的方法,将.dat文件读取为字符串,然后手动切掉文件的顶部并将其余部分另存为.dat文件。我希望以某种方式能够将结果另存为.csv或.xls,但我仍然找不到该怎么做的方法。此外,导入后,表格会转换为 t,并且不会返回已保存文件中的表格。我的代码如下

The file has a few tables and I am not sure if the tabulation structure here is standard for .dat files but I have tried to paste the text into excel and it recognises the text as separate table, so there is probably a fairly standard way to read the structure correctly into python. I couldn't find one so I've tried a really convoluted way, reading the .dat file into a string and manually chopping off the top of part of the file and saving the rest as .dat file. My hope is to then somehow be able to save the result as a .csv or .xls but I still can't find how to do that either. Furthermore after importing the tabulation is converted to \t's and this doesn't go back to tabulation in the saved files. My code for that is below


mylines = []                             
with open ('raw_file.dat', 'rt') as myfile:
    for myline in myfile:
        mylines.append(myline)

string = (mylines[8:])

with open("updated.dat", "w") as output:
    output.write(str(string))

我必须承认我是python的新手,我不确定我是否正确使用了这些函数。不过,我仍然希望有比我尝试的解决方法更直接的方法。

I must admit I am fairly new to python and I am not certain I use the functions correctly. Still, I hope there is a more straightforward way to go about it than the workaround I am attempting.

推荐答案

确保您要的第三个表从第8行开始,那么除了从第8行开始为文件建立索引之外,没有任何理由使您变得更加复杂。从那里,您可以使用字符串操作和列表解析来清理数据:

If you can be sure that the third table you want starts at the 8th line, then there's no reason you have to make it more complicated than just indexing the file from the 8th line up. From there, you can use string manipulation and list comprehension to clean your data:

import pandas as pd

# Read the data.
with open('raw_data.dat', 'r') as fh:
    lines = fh.readlines()[8:]

# Remove newlines, tabs, and split each string separated by spaces.
clean = [line.strip.replace('\t', '').split() for line in lines]

# Feed the data into a DataFrame.
data = pd.DataFrame(clean[1:], columns=clean[0])

其输出:

               Time         Variab1e1  ...               v18               v19
0  +0.00000000e+000  +3.04142181e-002  ...  +0.00000000e+000  +0.00000000e+000
1  +1.00000000e+000  +1.96144191e-001  ...  +1.00000000e+000  +0.00000000e+000
2  +2.00000000e+000  +3.75599731e-001  ...  +2.00000000e+000  +0.00000000e+000

如果您要将值转换为浮点数,可以在将数据转换为DataFrame之前执行以下操作:

If you want to convert the values into floats, you can do this before converting the data into a DataFrame:

headers = clean[0]
rows = [[float(value) for value in row] for row in clean[1:]]

data = pd.DataFrame(rows, columns=headers)

这可以使您的框架更加简洁:

which gives you a much cleaner frame:

   Time  Variab1e1  Variable2  Variable3  Variable4  ...  v15  v16   v17  v18  v19
0   0.0   0.030414        0.0   1.383808        0.0  ...  0.0  0.0  15.0  0.0  0.0
1   1.0   0.196144        1.0   7.660262        1.0  ...  0.0  1.0  15.0  1.0  0.0
2   2.0   0.375600        2.0  15.356726        2.0  ...  0.0  2.0  15.0  2.0  0.0

这篇关于如何将包含多个表的.dat文件读取到 pandas 数据框中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆