读取复杂文件的技巧 - Python [英] Tips for reading in a complex file - Python

查看:46
本文介绍了读取复杂文件的技巧 - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将复杂的可变文本文件读入 Python,但我不确定最佳策略是什么.我不是在找你为我编写任何代码,只是一些关于哪些模块最适合我的需求/技巧等的提示.

I have complex, variable text files that I want to read into Python, but I'm not sure what the best strategy would be. I'm not looking for you to code anything for me, just some tips about what modules would best suit my needs/tips etc.

文件看起来像:

Program
Username: X    Laser: X     Em: X

exp 1
    sample 1
        Time: X    Notes: X
        Read 1 X data
        Read 2 X data
        # unknown number of reads
    sample 2
        Time: X    Notes: X
        Read 1 X data
        ...
    # Unknown number of samples

exp 2
    sample 1
    ...
# Unknown number of experiments, samples and reads
# The 4 spaces between certain words represent tabs

要分析这些数据,我需要获取每个读数的数据,并知道它来自哪个样本和实验.另外,我可以更改输出文件格式,但我认为我在这里编写的方式最容易阅读.

To analyse this data I need to get the data for each reading and know which sample and experiment it came from. Also, I can change the output file format but I think the way I have written it here is the easiest to read.

要将这个文件读入 Python,我能想到的最好方法是逐行读取它并使用正则表达式搜索关键字.例如,搜索exp"关键字的行,然后记录其后的数字,然后在下一行搜索样本等等.但是,如果在注释"部分中使用了关键字,这当然不起作用.

To read this file in to Python the best way I can think of would be to read it in row by row and search for key words with regular expressions. For example, search the row for the "exp" keyword and then record the number after it, then search for sample in the next line and so on. However, of course this would not work if a keyword was used in the 'notes' section.

所以,我对什么最适合我的需求感到困惑(如果你不知道它的存在,很难使用它!)

So, I'm kind of stumped as to what would best suit my needs (it's hard to use something if you don't know it exists!)

感谢您的时间.

推荐答案

这是句法分析器.在这种情况下,由于

It's a typical task for a syntactic analyzer. In this case, since

  • 词法结构不跨越行边界,每行只有一个结构(语句").换句话说,每一行都是一条语句
  • 一组正则表达式可以覆盖一行的完整语法
  • 化合物的结构(=将多个语句"连接成更大的实体)简单明了

基于行的(相对)简单的无扫描仪解析器DFA 和上述一组正则表达式可以应用:

a (relatively) simple scannlerless parser based on lines, DFA and the aforementioned set of regexes can be applied:

  • 设置初始解析器状态(=相对于要跟踪的各种实体的当前位置)和解析树(=以方便的方式表示来自文件的信息的数据结构)
  • 对于每一行
    • 对其进行分类,例如通过匹配适用于当前状态的正则表达式
    • 使用匹配的正则表达式的组来获取该行语句的有意义的部分
    • 使用这些部分,更新状态和解析树

    参见获取文件中的路径在 {} by python 中作为示例.在那里,我不构建解析树(不需要),而只跟踪当前状态.

    See get the path in a file inside {} by python for an example. There, I do not construct a parse tree (wasn't needed) but only track the current state.

    这篇关于读取复杂文件的技巧 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆