Pandas通过regex读取带有字符串分隔符的CSV [英] Pandas Read CSV with string delimiters via regex
问题描述
我试图将一个奇怪格式的文本文件导入到一个pandas DataFrame。以下是两个示例行:
I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:
LOADED LANE 1 MAT. TYPE= 2 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.474 LOADEFFECT 5075. LMAX= 3643. COV= .13
LOADED LANE 1 MAT. TYPE= 3 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.515 LOADEFFECT10009. LMAX= 9732. COV= .08
首先我尝试以下操作:
First I tried the following:
df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])
这似乎工作正常,打开上面的示例行,在 LOADEFFECT
字符串后面没有空格(在示例中可能需要向右滚动一点才能看到它)。我得到了一个结果:
This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT
string (you may need to scroll a bit right to see it in the example). I got a result like:
632 1 2 1 200 10 3.474 5075. 3643. 0.13
633 1 3 1 200 10 3.515 LMAX= COV= NaN
然后我决定使用正则表达式来定义我的分隔符。经过多次尝试和错误运行(我不是专家在regex),我设法接近以下行:
Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:
df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')
这几乎可以工作,但创建一个 NaN
栏位:
This almost works, but creates a NaN
column for some reason at the very beginning:
632 NaN 1 2 1 200 10 3.474 5075 3643 0.13
633 NaN 1 3 1 200 10 3.515 10009 9732 0.08
我可以删除第一列,并逃避它。然而我不知道什么是正确的方式来设置正则表达式正确解析这个文本文件在一次。有任何想法吗?除此之外,我相信有一个更聪明的方式来解析这个文本文件。非常高兴听到您的建议。
At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.
谢谢!
推荐答案
import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))
table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values
这篇关于Pandas通过regex读取带有字符串分隔符的CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!