根据指定格式验证CSV [英] Verify CSV against given format

查看:213
本文介绍了根据指定格式验证CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望用户将最大为1MB的CSV文件上传到应符合以下格式的网页表单:

I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:

"<String>","<String>",<Int>,<Float>

这将在以后处理。我想验证该文件是否符合指定的格式,以便稍后使用该文件的程序不会收到意外的输入,并且没有安全问题(例如一些注入攻击解析脚本,做一些计算和db插入)。

That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).

(1)要做到这一点,最好的方法是快速,彻底?根据我研究的内容,我可以走正规表达式的路径,或者更像这个。我已经查看了python csv模块,但是似乎没有任何内置的验证。

(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.

(2)假设我去一个正则表达式,任何人都可以指示我朝着最好的方式做到这一点?我匹配非法字符并拒绝吗? (例如,no'/''\'''''''{''}等)或匹配所有合法例如。 [a-zA-Z0-9] {1,10}字符串组件?我不太熟悉正则表达式,所以指针或示例将不胜感激。

(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.

编辑:
字符串应该不包含逗号或引号,它只包含一个名称(即名字,姓氏)。是的,我忘了添加他们会被双引号。

Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.

编辑#2:
感谢所有的答案。 Cutplace很有趣,但是是一个独立的。

EDIT #2: Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.

推荐答案

Pyparsing会处理这些数据,并且将容忍诸如逗号之前和之后的空格,引号内的逗号等意外的事情(csv模块太多了,但是regex解决方案迫使你在整个地方添加\s *位)。

Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).

from pyparsing import *

integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
        integer + COMMA + floatnum + LineEnd()

tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()

for t in tests:
    print t
    try:
        print validLine.parseString(t).asList()
    except ParseException, pe:
        print pe.markInputline('?')
        print pe.msg
    print

列印

"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]

"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]

bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes

"bad","good2",100,3
"bad","good2",100,?3
Expected float

"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","

您可能会在将来取消这些引号pyparsing可以在解析时通过添加:

You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:

dblQuotedString.setParseAction(removeQuotes)

如果您想在输入文件中添加注释支持,可以在后面加上# :

If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:

comment = '#' + restOfline
validLine.ignore(comment)

您还可以向这些字段添加名称,以便您可以通过名称而不是索引位置访问它们根据改变的路):

You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):

validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
        integer("qty") + COMMA + floatnum("price") + LineEnd()

然后你的后处理代码可以这样做:

And your post-processing code can then do this:

data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price

这篇关于根据指定格式验证CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆