Python / Pandas CSV解析 [英] Python/ Pandas CSV Parsing
问题描述
我使用JotForm可配置列表小部件收集数据,但有麻烦解析正确的结果数据。当我使用
testdf = pd.read_csv(TestLoad.csv)
数据作为两个记录读入,详细信息存储在信息列中。我理解为什么它被解析的方式,但我想把细节分成多个记录,如下所示。
任何帮助将不胜感激。 p>
样本CSV
Information,Type
2015-12-06,First:Tom,Last:Smith,School:MCAA; First:Tammy,Last:Smith,School:MCAA;,New
2015-12-06,First:Jim,Last:Jones,School:MCAA; First:Jane,Last:Jones,School:MCAA;,New
当前结果
日期信息类型
2015-12-06第一:Tom,Last:Smith,School:MCAA;第一:Tammy,最后:史密斯,学校:MCAA;新
2015-12-06第一名:Jim,最后:Jones,学校:MCAA;第一:简,最后:琼斯,学校:MCAA;新建
所需结果
日期第一个最后学校类型
2015-12-06 Tom Smith MCAA新
2015-12-06 Tammy Smith MCAA新
2015- 12-06 Jim Jones MCAA新
2015-12-06 Jane Jones MCAA新
这是一个无用的文本,需要保持一个答案被主持人downvote。这是我使用的数据:
日期,信息,类型
2015-12 -07,First:Jim,Last:Jones,School:MCAA; First:Jane,Last:Jones,School:MCAA;,Old
2015-12-06, ,Last:Smith,School:MCAA; First:Tammy,Last:Smith,School:MCAA;,New
import pandas as pd
import numpy as np
import csv
import re
import itertools as it
import pprint
import datetime as dt
records = []#为每个人构建一个完整的记录
colon_pairs = r
(\w +)#在组1中捕获一个或多个字符,后跟$ ..
:#A冒号,后跟...
\s * #Whitespace,0次或更多次,后面是...
(\w +)#在组2中捕获一个或多个字符的'字'字符。
colon_pairs_per_person = 3
with open(csv1.csv,encoding ='utf-8')as f:
next )#skip header line
record = {}
日期,信息,csv.reader(f)中的the_type:
info_parser = re.finditer(colon_pairs,info,flags = re.X)
for i,match_obj in enumerate(info_parser):
key,val = match_obj.groups()
record [key] = val
if(i + 1)%colon_pairs_per_person == 0:#再用一个人的信息完成
record ['Date'] = dt.datetime.strptime(date,'%Y-%m- %d')#可以按日期对DataFrame行进行排序。
record ['Type'] = the_type
records.append(record)
record = {}
pprint.pprint $ b df = pd.DataFrame(
sorted(records,key = lambda record:record ['Date'])
)
print(df)
df.set_index Date',inplace = True)
print(df)
--output: -
[{'Date':datetime.datetime(2015,12,7,0 ,0),
'First':'Jim',
'Last':'Jones',
'School':'MCAA',
' '},
{'Date':datetime.datetime(2015,12,7,0,0),
'First':'Jane',
' ,
'School':'MCAA',
'Type':'Old'},
{'Date':datetime.datetime(2015,12,6,0,0)
'First':'Tom',
'Last':'Smith',
'School':'MCAA',
'Type':'New'},
{'Date':datetime.datetime(2015,12,6,0,0),
'First':'Tammy',
'Last':'Smith',
'学校':'MCAA',
'类型':'新'}]
日期优先最后学校类型
0 2015-12-06 Tom Smith MCAA New
1 2015-12-06 Tammy Smith MCAA新
2 2015-12-07 Jim Jones MCAA旧
3 2015-12-07 Jane Jones MCAA旧
第一上学类型
日期
2015-12-06 Tom Smith MCAA新
2015-12-06 Tammy Smith MCAA新
2015-12-07 Jim Jones MCAA旧
2015-12-07 Jane Jones MCAA旧
I used JotForm Configurable list widget to collect data, but having troubles parsing the resulting data correctly. When I use
testdf = pd.read_csv ("TestLoad.csv")
The data is read in as two records and the details are stored in the "Information" column. I understand why it is parsed the way it is, but I would like to break out the details into multiple records as noted below.
Any help would be appreciated.
Sample CSV
"Date","Information","Type"
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New"
"2015-12-06","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","New"
Current Result
Date Information Type
2015-12-06 First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA; New
2015-12-06 First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA; New
Desired Result
Date First Last School Type
2015-12-06 Tom Smith MCAA New
2015-12-06 Tammy Smith MCAA New
2015-12-06 Jim Jones MCAA New
2015-12-06 Jane Jones MCAA New
This is useless text that is required to keep an answer from being downvoted by the moderators. Here is the data I used:
"Date","Information","Type"
"2015-12-07","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","Old"
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New"
import pandas as pd
import numpy as np
import csv
import re
import itertools as it
import pprint
import datetime as dt
records = [] #Construct a complete record for each person
colon_pairs = r"""
(\w+) #Match a 'word' character, one or more times, captured in group 1, followed by..
: #A colon, followed by...
\s* #Whitespace, 0 or more times, followed by...
(\w+) #A 'word' character, one or more times, captured in group 2.
"""
colon_pairs_per_person = 3
with open("csv1.csv", encoding='utf-8') as f:
next(f) #skip header line
record = {}
for date, info, the_type in csv.reader(f):
info_parser = re.finditer(colon_pairs, info, flags=re.X)
for i, match_obj in enumerate(info_parser):
key, val = match_obj.groups()
record[key] = val
if (i+1) % colon_pairs_per_person == 0: #then done with info for a person
record['Date'] = dt.datetime.strptime(date, '%Y-%m-%d') #So that you can sort the DataFrame rows by date.
record['Type'] = the_type
records.append(record)
record = {}
pprint.pprint(records)
df = pd.DataFrame(
sorted(records, key=lambda record: record['Date'])
)
print(df)
df.set_index('Date', inplace=True)
print(df)
--output:--
[{'Date': datetime.datetime(2015, 12, 7, 0, 0),
'First': 'Jim',
'Last': 'Jones',
'School': 'MCAA',
'Type': 'Old'},
{'Date': datetime.datetime(2015, 12, 7, 0, 0),
'First': 'Jane',
'Last': 'Jones',
'School': 'MCAA',
'Type': 'Old'},
{'Date': datetime.datetime(2015, 12, 6, 0, 0),
'First': 'Tom',
'Last': 'Smith',
'School': 'MCAA',
'Type': 'New'},
{'Date': datetime.datetime(2015, 12, 6, 0, 0),
'First': 'Tammy',
'Last': 'Smith',
'School': 'MCAA',
'Type': 'New'}]
Date First Last School Type
0 2015-12-06 Tom Smith MCAA New
1 2015-12-06 Tammy Smith MCAA New
2 2015-12-07 Jim Jones MCAA Old
3 2015-12-07 Jane Jones MCAA Old
First Last School Type
Date
2015-12-06 Tom Smith MCAA New
2015-12-06 Tammy Smith MCAA New
2015-12-07 Jim Jones MCAA Old
2015-12-07 Jane Jones MCAA Old
这篇关于Python / Pandas CSV解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!