pandas 根据标题读取文本文件切片列 [英] Pandas read text file slicing columns according to header

查看:27
本文介绍了 pandas 根据标题读取文本文件切片列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一个看起来像这样的文本文件:

Imagine a text file that looks like this:

Places       Person  Number         Comments
   bar    anastasia      75        very lazy
  home        jimmy           nothing to say
 beach                    2                 

将第一行视为包含我想要用于 Pandas 数据框的列名称的标题.您可以看到有空单元格.并且有一列包含带空格的字符串.这个文件有一个可读的模式,列之间至少有2个空格隔开,每列的信息基本上可以从列名的终点到上一个列名的终点是红色的.这一点没有歧义.

Consider the first line as the header containing the names of the columns I want for my pandas data frame. You can see that there are empty cells. And there is a column that has strings with spaces. There is a readable patter in this file, columns are separated by at least 2 spaces and the information of each column can be red from the end point of the column name to the end point of the previous column name basically. There is no ambiguity in this.

如果我这样做

df = pd.read_csv('text_file.txt')

我将得到一个 3 x 1 的数据框,其中唯一的列被称为 Places Person Number Comments".所以它无法理解表格格式.

I will get a 3 x 1 data frame where the only column gets called "Places Person Number Comments". So it fails to understand the table format.

如果我这样做

df = pd.read_csv('text_file.txt', delim_whitespace = True)

它将创建大量的列,但无法理解 Comments 中值中的空格,并将拆分注释并将其发送到不同的单元格,如下所示:

It will create the good number of columns but won't be able to understand the spaces in the values in Comments and will split the comments and send it to different cells, like so:

          Places   Person Number Comments
bar    anastasia       75   very     lazy
home       jimmy  nothing     to      say
beach          2      NaN    NaN      NaN


如果我这样做


If I do

df = pd.read_csv('text_file.txt', sep = '\s{2,}', engine = 'python')

它会理解只有当有两个或多个空格时才可以将其视为另一列的一部分.所以这是正确的.但它无法理解有空单元格,并将错误地将单元格从一列移动到另一列.

It will understand that only if there are two or more spaces it can be considered part of another column. So that's correct. But it won't be able to understand that there are empty cells and will wrongly displace cells from one column to another.

  Places     Person          Number   Comments
0    bar  anastasia              75  very lazy
1   home      jimmy  nothing to say       None
2  beach          2            None       None

此时我不知道该怎么办.在 Pandas 中是否有一种优雅的方法来做到这一点?

At this point I don't know what to do. Is there an elegant way to do this in Pandas?

推荐答案

您可以使用 pd.read_fwf() 将您的文件(固定宽度格式的行文件)读入 DataFrame.

You can use pd.read_fwf() to read your file, which is a file of fixed-width formatted lines, into DataFrame.

df = pd.read_fwf('text_file.txt')

演示

我使用 StringIO 作为演示.您可以使用实际文件名作为函数调用的参数.

Demo

I use StringIO as demo. You can use your actual file name as parameter to the function call.

text = """
Places       Person  Number         Comments
   bar    anastasia      75        very lazy
  home        jimmy           nothing to say
 beach                    2                 
"""

from io import StringIO
df = pd.read_fwf(StringIO(text))

print(df)

  Places     Person  Number        Comments
0    bar  anastasia    75.0       very lazy
1   home      jimmy     NaN  nothing to say
2  beach        NaN     2.0             NaN

这篇关于 pandas 根据标题读取文本文件切片列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆