pandas 根据标题读取文本文件切片列 [英] Pandas read text file slicing columns according to header
问题描述
想象一个看起来像这样的文本文件:
Imagine a text file that looks like this:
Places Person Number Comments
bar anastasia 75 very lazy
home jimmy nothing to say
beach 2
将第一行视为包含我想要用于 Pandas 数据框的列名称的标题.您可以看到有空单元格.并且有一列包含带空格的字符串.这个文件有一个可读的模式,列之间至少有2个空格隔开,每列的信息基本上可以从列名的终点到上一个列名的终点是红色的.这一点没有歧义.
Consider the first line as the header containing the names of the columns I want for my pandas data frame. You can see that there are empty cells. And there is a column that has strings with spaces. There is a readable patter in this file, columns are separated by at least 2 spaces and the information of each column can be red from the end point of the column name to the end point of the previous column name basically. There is no ambiguity in this.
如果我这样做
df = pd.read_csv('text_file.txt')
我将得到一个 3 x 1 的数据框,其中唯一的列被称为 Places Person Number Comments"
.所以它无法理解表格格式.
I will get a 3 x 1 data frame where the only column gets called "Places Person Number Comments"
. So it fails to understand the table format.
如果我这样做
df = pd.read_csv('text_file.txt', delim_whitespace = True)
它将创建大量的列,但无法理解 Comments
中值中的空格,并将拆分注释并将其发送到不同的单元格,如下所示:>
It will create the good number of columns but won't be able to understand the spaces in the values in Comments
and will split the comments and send it to different cells, like so:
Places Person Number Comments
bar anastasia 75 very lazy
home jimmy nothing to say
beach 2 NaN NaN NaN
如果我这样做
If I do
df = pd.read_csv('text_file.txt', sep = '\s{2,}', engine = 'python')
它会理解只有当有两个或多个空格时才可以将其视为另一列的一部分.所以这是正确的.但它无法理解有空单元格,并将错误地将单元格从一列移动到另一列.
It will understand that only if there are two or more spaces it can be considered part of another column. So that's correct. But it won't be able to understand that there are empty cells and will wrongly displace cells from one column to another.
Places Person Number Comments
0 bar anastasia 75 very lazy
1 home jimmy nothing to say None
2 beach 2 None None
此时我不知道该怎么办.在 Pandas 中是否有一种优雅的方法来做到这一点?
At this point I don't know what to do. Is there an elegant way to do this in Pandas?
推荐答案
您可以使用 pd.read_fwf()
将您的文件(固定宽度格式的行文件)读入 DataFrame.
You can use pd.read_fwf()
to read your file, which is a file of fixed-width formatted lines, into DataFrame.
df = pd.read_fwf('text_file.txt')
演示
我使用 StringIO 作为演示.您可以使用实际文件名作为函数调用的参数.
Demo
I use StringIO as demo. You can use your actual file name as parameter to the function call.
text = """
Places Person Number Comments
bar anastasia 75 very lazy
home jimmy nothing to say
beach 2
"""
from io import StringIO
df = pd.read_fwf(StringIO(text))
print(df)
Places Person Number Comments
0 bar anastasia 75.0 very lazy
1 home jimmy NaN nothing to say
2 beach NaN 2.0 NaN
这篇关于 pandas 根据标题读取文本文件切片列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!