Python正则表达式,用于读取类似CSV的行 [英] Python regex for reading CSV-like rows

查看:155
本文介绍了Python正则表达式,用于读取类似CSV的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析传入的类似CSV的数据行.值之间用逗号分隔(逗号前后可能会有前导空格和尾随空格),并且可以用'或'引起来.例如-这是有效行:

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:

    data1, data2  ,"data3'''",  'data4""',,,data5,

但是这个格式不正确:

    data1, data2, da"ta3", 'data4',

-引号只能以空格开头或结尾.

-- quotation marks can only be prepended or trailed by spaces.

应该识别这种格式错误的行-最好以某种方式在行中标记格式错误的值,但是如果正则表达式与整个行不匹配,则它也是可以接受的.

Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.

我正在尝试使用findall()的match()编写能够解析此正则表达式的函数,但是我要使用的每个正则表达式在边缘情况下都存在一些问题.

I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.

因此,也许有解析类似内容的经验的人可以帮助我解决这个问题? (或者这对于正则表达式来说太复杂了,我应该写一个函数)

So, maybe someone with experience in parsing something similar could help me on this? (Or maybe this is too complex for regex and I should just write a function)

csv模块在这里使用不多:

csv module is not much of use here:

    >>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
    [['2', ' "dat', 'a1"', " 'dat", "a2'", '']]

    >>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
    [['2', 'dat,a1', "'dat", "a2'", '']]

-除非可以调整?

进行了一些语言编辑-我希望现在英语更加有效

A few language edits - I hope it's more valid English now

谢谢您的所有回答,我现在非常确定正则表达式不是一个好主意,因为(1)涵盖所有边缘情况可能很棘手(2)编写器输出不规则.写下来,我决定检查提到的pyparsing并使用它,或编写类似FSM的自定义解析器.

Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

推荐答案

尽管您可能会通过预处理,使用csv模块,后处理以及使用正则表达式的某种组合来实现要求与csv模块的设计不太吻合,也与正则表达式不太吻合(取决于您可能需要处理的嵌套引号的复杂性).

Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).

在复杂的解析情况下, pyparsing 总是可以依靠的.如果这不是一次性的情况,则可能会产生最直接和可维护的结果,但可能需要付出一些额外的努力.考虑到投资可以很快得到回报,但是,因为您省去了调试正则表达式解决方案以处理极端情况的额外工作...

In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...

您可能会找到此问题也许足以让您入门.

You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.

这篇关于Python正则表达式,用于读取类似CSV的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆