如何使用 Python 解析复杂的文本文件? [英] How to parse complex text files using Python?
问题描述
我正在寻找一种将复杂文本文件解析为 Pandas DataFrame 的简单方法.下面是一个示例文件,我希望解析后的结果是什么样的,以及我当前的方法.
I'm looking for a simple way of parsing complex text files into a pandas DataFrame. Below is a sample file, what I want the result to look like after parsing, and my current method.
有没有办法让它更简洁/更快/更pythonic/更易读?
Is there any way to make it more concise/faster/more pythonic/more readable?
我也在代码审查上提出了这个问题.
I've also put this question on Code Review.
我最终写了一篇博客文章来向初学者解释这一点.
这是一个示例文件:
Sample text
A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.
School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel
Student number, Score
0, 3
1, 7
Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora
Student number, Score
0, 6
1, 3
2, 9
School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna
Student number, Score
0, 8
1, 7
Grade = 2
Student number, Name
0, Harry
1, Hermione
Student number, Score
0, 5
1, 10
Grade = 3
Student number, Name
0, Fred
1, George
Student number, Score
0, 0
1, 0
这是我希望解析后的结果:
Here is what I want the result to look like after parsing:
Name Score
School Grade Student number
Hogwarts 1 0 Ginny 8
1 Luna 7
2 0 Harry 5
1 Hermione 10
3 0 Fred 0
1 George 0
Riverdale High 1 0 Phoebe 3
1 Rachel 7
2 0 Angela 6
1 Tristan 3
2 Aurora 9
这是我目前解析它的方式:
Here is how I currently parse it:
import re
import pandas as pd
def parse(filepath):
"""
Parse text at given filepath
Parameters
----------
filepath : str
Filepath for file to be parsed
Returns
-------
data : pd.DataFrame
Parsed data
"""
data = []
with open(filepath, 'r') as file:
line = file.readline()
while line:
reg_match = _RegExLib(line)
if reg_match.school:
school = reg_match.school.group(1)
if reg_match.grade:
grade = reg_match.grade.group(1)
grade = int(grade)
if reg_match.name_score:
value_type = reg_match.name_score.group(1)
line = file.readline()
while line.strip():
number, value = line.strip().split(',')
value = value.strip()
dict_of_data = {
'School': school,
'Grade': grade,
'Student number': number,
value_type: value
}
data.append(dict_of_data)
line = file.readline()
line = file.readline()
data = pd.DataFrame(data)
data.set_index(['School', 'Grade', 'Student number'], inplace=True)
# consolidate df to remove nans
data = data.groupby(level=data.index.names).first()
# upgrade Score from float to integer
data = data.apply(pd.to_numeric, errors='ignore')
return data
class _RegExLib:
"""Set up regular expressions"""
# use https://regexper.com to visualise these if required
_reg_school = re.compile('School = (.*)
')
_reg_grade = re.compile('Grade = (.*)
')
_reg_name_score = re.compile('(Name|Score)')
def __init__(self, line):
# check whether line has a positive match with all of the regular expressions
self.school = self._reg_school.match(line)
self.grade = self._reg_grade.match(line)
self.name_score = self._reg_name_score.search(line)
if __name__ == '__main__':
filepath = 'sample.txt'
data = parse(filepath)
print(data)
推荐答案
Update 2019 (PEG parser):
这个答案受到了相当多的关注,所以我想添加另一种可能性,即解析选项.这里我们可以使用 PEG
解析器代替(例如 parsimonious
) 结合 NodeVisitor
类:
Update 2019 (PEG parser):
This answer has received quite some attention so I felt to add another possibility, namely a parsing option. Here we could use a PEG
parser instead (e.g. parsimonious
) in combination with a NodeVisitor
class:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import pandas as pd
grammar = Grammar(
r"""
schools = (school_block / ws)+
school_block = school_header ws grade_block+
grade_block = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws?
school_header = ~"^School = (.*)"m
grade_header = ~"^Grade = (d+)"m
name_header = "Student number, Name"
score_header = "Student number, Score"
number_name = index comma name ws
number_score = index comma score ws
comma = ws? "," ws?
index = number+
score = number+
number = ~"d+"
name = ~"[A-Z]w+"
ws = ~"s*"
"""
)
tree = grammar.parse(data)
class SchoolVisitor(NodeVisitor):
output, names = ([], [])
current_school, current_grade = None, None
def _getName(self, idx):
for index, name in self.names:
if index == idx:
return name
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_school_header(self, node, children):
self.current_school = node.match.group(1)
def visit_grade_header(self, node, children):
self.current_grade = node.match.group(1)
self.names = []
def visit_number_name(self, node, children):
index, name = None, None
for child in node.children:
if child.expr.name == 'name':
name = child.text
elif child.expr.name == 'index':
index = child.text
self.names.append((index, name))
def visit_number_score(self, node, children):
index, score = None, None
for child in node.children:
if child.expr.name == 'index':
index = child.text
elif child.expr.name == 'score':
score = child.text
name = self._getName(index)
# build the entire entry
entry = (self.current_school, self.current_grade, index, name, score)
self.output.append(entry)
sv = SchoolVisitor()
sv.visit(tree)
df = pd.DataFrame.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)
<小时>
正则表达式选项(原始答案)
那么,第 x 次观看指环王时,我不得不为大结局留出一些时间:<小时>分解,我们的想法是将问题分解为几个较小的问题:
Regex option (original answer)
Well then, watching Lord of the Rings the xth time, I had to bridge some time to the very finale:
Broken down, the idea is to split the problem up into several smaller problems:
- 将每所学校分开
- ...每个年级
- ...学生和分数
- ... 之后将它们绑定到一个数据框中
<小时>学校部分(参见 regex101.com 上的演示)>
^
Schools*=s*(?P<school_name>.+)
(?P<school_content>[sS]+?)
(?=^School|)
<小时>成绩部分(regex101.com 上的另一个演示)
^
Grades*=s*(?P<grade>.+)
(?P<students>[sS]+?)
(?=^Grade|)
<小时>学生/分数部分(regex101.com 上的最后一个演示):
^
Student number, Name[
]
(?P<student_names>(?:^d+.+[
])+)
s*
^
Student number, Score[
]
(?P<student_scores>(?:^d+.+[
])+)
剩下的就是一个生成器表达式,然后被送入 DataFrame
构造函数(连同列名).<小时>代码:
The rest is a generator expression which is then fed into the DataFrame
constructor (along with the column names).
The code:
import pandas as pd, re
rx_school = re.compile(r'''
^
Schools*=s*(?P<school_name>.+)
(?P<school_content>[sS]+?)
(?=^School|)
''', re.MULTILINE | re.VERBOSE)
rx_grade = re.compile(r'''
^
Grades*=s*(?P<grade>.+)
(?P<students>[sS]+?)
(?=^Grade|)
''', re.MULTILINE | re.VERBOSE)
rx_student_score = re.compile(r'''
^
Student number, Name[
]
(?P<student_names>(?:^d+.+[
])+)
s*
^
Student number, Score[
]
(?P<student_scores>(?:^d+.+[
])+)
''', re.MULTILINE | re.VERBOSE)
result = ((school.group('school_name'), grade.group('grade'), student_number, name, score)
for school in rx_school.finditer(string)
for grade in rx_grade.finditer(school.group('school_content'))
for student_score in rx_student_score.finditer(grade.group('students'))
for student in zip(student_score.group('student_names')[:-1].split("
"), student_score.group('student_scores')[:-1].split("
"))
for student_number in [student[0].split(", ")[0]]
for name in [student[0].split(", ")[1]]
for score in [student[1].split(", ")[1]]
)
df = pd.DataFrame(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)
<小时>浓缩:
rx_school = re.compile(r'^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|)', re.MULTILINE)
rx_grade = re.compile(r'^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|)', re.MULTILINE)
rx_student_score = re.compile(r'^Student number, Name[
](?P<student_names>(?:^d+.+[
])+)s*^Student number, Score[
](?P<student_scores>(?:^d+.+[
])+)', re.MULTILINE)
<小时>这产生
School Grade Student number Name Score
0 Riverdale High 1 0 Phoebe 3
1 Riverdale High 1 1 Rachel 7
2 Riverdale High 2 0 Angela 6
3 Riverdale High 2 1 Tristan 3
4 Riverdale High 2 2 Aurora 9
5 Hogwarts 1 0 Ginny 8
6 Hogwarts 1 1 Luna 7
7 Hogwarts 2 0 Harry 5
8 Hogwarts 2 1 Hermione 10
9 Hogwarts 3 0 Fred 0
10 Hogwarts 3 1 George 0
<小时>至于时序,这是运行一万次的结果:
As for timing, this is the result running it a ten thousand times:
import timeit
print(timeit.timeit(makedf, number=10**4))
# 11.918397722000009 s
这篇关于如何使用 Python 解析复杂的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!