从SQL Server 2008解析CSV的语义正确的方法是什么? [英] What's a semantically-correct way to parse CSV from SQL Server 2008?

查看:148
本文介绍了从SQL Server 2008解析CSV的语义正确的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从SQL Server 2008得到一个CSV转储,有这样的行:

 管道,196222006P,REPLACE LEAD WATER SERVICE W / 1COPPER,1996-08-09 00:00:00 
Construction,197133031B,MORGAN SHOESALT,1997-05-13 00:00:00
Electrical,197135021E,SERVICE ,OUTLETS,1997-05-15 00:00:00
Electrical,197135021E,SERVICE,OUTLETSFOOBAR,1997-05-15 00:00:00
Construction,198120036B ,MERITER,IT IT CTR,NCRANDTRACEALTERATION,1998-04-30 00:00:00

parse_dbenhur 很漂亮,但可以重写以支持逗号和引号

 #@ dbenhur的优秀答案,它工作100%为我最初要求
SEP = /(?:,| \Z)/
QUOTED = /([^] *)/
UNQUOTED = /([^,] *)/
FIELD = /(?:#{QUOTED} |#{UNQUOTED})#{SEP} /
def parse_dbenhur(line)
。 scan(FIELD)[0 ...- 1] .map {| matches | matches [0] ||匹配[1]}
end

def parse_ugly(line)
dumb_fields = line.chomp.split(',')。 v.gsub(/ \s + /,'')}
fields = []
open = false
dumb_fields.each_with_index do | v,i |
打开? fields.last.concat(v):fields.push(v)
open =(v.start_with?('')和(v.count('')%2 == 1)和dumb_fields [ i + 1]和dumb_fields [i + 1] .start_with?())|| ('open'和'v.end_with?(''))
end
fields.map {| v |(v.start_with?('')和v.end_with? )?v [1 ..- 2]:v}
end

lines = []
lines<<'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W / 1COPPER,1996-08-09 00:00:00'
lines<< 'Construction,197133031B,MORGAN SHOESALT,1997-05-13 00:00:00'
lines<< 'Electrical,197135021E,SERVICE,OUTLETS,1997-05-15 00:00:00'
lines<< 'Electrical,197135021E,SERVICE,OUTLETSFOOBAR,1997-05-15 00:00:00'
lines< 'Construction,198120036B,MERITER,IT IT CTR,NCRANDTRACEALTERATION,1998-04-30 00:00:00'

require'csv'
lines.each do | line |
puts
puts line
begin
c = CSV.parse_line(line)
puts#{c.to_csv.chomp}(size#{c.length})
rescue
putsFasterCSV说:#{$!}
end
a = parse_ugly(line)
puts#{a.to_csv.chomp} size#{a.length})
b = parse_dbenhur(line)
puts#{b.to_csv.chomp}(size#{b.length})
end

这是我运行时的输出:

 水暖,19​​6222006P,REPLACE领导水务服务W / 1COPPER,1996-08-09 00:00:00 
FasterCSV说:第1行的非法报价。
管道,196222006P,REPLACE LEAD WATER SERVICE W / 1COPPER,1996-08-09 00:00:00(尺寸4)
管道,196222006P,更换铅水服务W / 1 ,1996-08-09 00:00:00(size 4)

建筑,197133031B,摩根鞋ALT,1997-05-13 00:00:00
FasterCSV说:
建筑,197133031B,MORGAN SHOESALT,1997-05-13 00:00:00(尺寸4)
建筑,197133031B, MORGAN SHOESALT,1997-05-13 00:00:00(尺寸4)

电气,197135021E,SERVICE,OUTLETS,1997-05-15 00:00: 00
FasterCSV说:第1行中缺少或错过报价
电气,197135021E,SERVICEOUTLETS,1997-05-15 00:00:00(尺寸4)
电气,197135021E,服务,OUTLETS,1997-05-15 00:00:00(尺寸5)

电气,197135021E, OUTLETSFOOBAR,1997-05-15 00:00:00
FasterCSV说:第1行中缺少或偏离的报价
电气,197135021E,SERVICEOUTLETSFOOBAR,1997-05 -15 00:00:00(size 4)
电气,197135021E,SERVICE,OUTLETSFOOBAR,1997-05-15 00:00:00(size 5)

Construction,198120036B,MERITER,IT IT CTR,NCRANDTRACEALTERATION,1998-04-30 00:00: 00
Construction,198120036B,MERITER,IT IT CTR,NCRANDTRACEALTERATION,1998-04-30 00:00:00 4)
Construction,198120036B,MERITER,IT IT CTR,NCRANDTRACE ALTERATION,1998-04-30 00:00:00(size 6)
Construction,198120036B,MERITER ,NCRANDTRACEALTERATION,1998-04-30 00:00:00(size 6)



UPDATE



字段有逗号时使用双引号。



UPDATE 2





UPDATE 3



我从客户端学到的是SQL Server 2008的导出此奇怪的CSV - 已报告给Microsoft 此处此处



UPDATE 4



@ dbenhur的答案完全适合我原先要求的答案,但指出我忽略了使用逗号和引号显示行。我会接受d @ benhur的回答 - 但我希望它可以改进以上所有行。



HOPEFULLY FINAL UPDATE / p>

此代码适用于(我认为它语义正确):

  QUOTED = /((?:(?!)))*)/ 
SEPQ = /,(?!)/
UNQUOTED = /([^,] *)/
SEPU = /,(?=(?:[^] | (?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})| \Z /

def parse_sql_server_2008_csv_line(line)
line.scan (FIELD)[0 ...- 1] .map {| matches |(matches [0] || matches [1])。tr(',','').gsub(/ \s + /, )}
end

根据@dbenhur和@ ghostdog74的答案改写如何使用逗号处理CSV文件?

解决方案

以下使用regexp和 String#scan 。我观察到,在你正在处理的破碎的CSV格式,只有当它在开始时的引号属性字段。



扫描在连续匹配regexp的字符串中移动,所以regexp可以假设它的开始匹配点是一个字段的开始。我们构造regexp,匹配未包含内部引号( QUOTED 的平衡引号字段( UNQUOTED )当任何一个替代字段表示匹配时,它必须后跟一个分隔符,它可以是逗号或字符串的结尾( SEP



因为 UNQUOTED 可以在分隔符之前匹配零长度字段,所以扫描总是匹配我们丢弃的末尾的空字段 [0 ...- 1] 。Scan产生一个元组数组;每个元组都是一个捕获组数组,所以我们 map 匹配每个元素使用匹配[0] ||来捕获捕获的交替matches [1]



您的示例行中没有显示包含逗号和引号的字段 - 我不知道

  SEP = /(?:,| \ Z)/ 
QUOTED = /([^] *)/
UNQUOTED = /([^,] *)/

FIELD = {QUOTED} |#{UNQUOTED})#{SEP} /

def ugly_parse line
line.scan(FIELD)[0 ...- 1] .map {| matches | matches [0] || matches [1]}
end

lines.each do | l |
puts l
puts ugly_parse(l).inspect
puts
end

#电气,197135021E,SERVICE,OUTLETS,1997-05-15 00:00:00
#[Electrical,197135021E SERVICE,OUTLETS,1997-05-15 00:00:00]

#管道,196222006P,REPLACE LEAD WATER SERVICE W / 1COPPER,1996-08-09 00: 00:00
#[Plumbing,196222006P,REPLACE LEAD WATER SERVICE W / 1\COPPER,1996-08-09 00:00:00]

#Construction,197133031B,MORGAN SHOESALT,1997-05-13 00:00:00
#[Construction,197133031B,MORGAN SHOES\ALT, 05-13 00:00:00]


I got a CSV dump from SQL Server 2008 that has lines like this:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00

parse_dbenhur is pretty, but can it be rewritten to support the presence of both commas and quotes? parse_ugly is, well, ugly.

# @dbenhur's excellent answer, which works 100% for what i originally asked for
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def parse_dbenhur(line)
  line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

def parse_ugly(line)
  dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') }
  fields = []
  open = false
  dumb_fields.each_with_index do |v, i|
    open ? fields.last.concat(v) : fields.push(v)
    open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"'))
  end
  fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v }
end

lines = []
lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00'
lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00'
lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00'

require 'csv'
lines.each do |line|
  puts
  puts line
  begin
    c = CSV.parse_line(line)
    puts "#{c.to_csv.chomp} (size #{c.length})"
  rescue
    puts "FasterCSV says: #{$!}"
  end
  a = parse_ugly(line)
  puts "#{a.to_csv.chomp} (size #{a.length})"
  b = parse_dbenhur(line)
  puts "#{b.to_csv.chomp} (size #{b.length})"
end

Here's the output when I run it:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
FasterCSV says: Illegal quoting in line 1.
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)

Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
FasterCSV says: Unclosed quoted field on line 1.
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)

Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5)

Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5)

Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4)
Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)

UPDATE

Note that the CSV uses double quotes when a field has a comma.

UPDATE 2

It's fine if commas are stripped out of the fields in question... my parse_ugly method doesn't preserve them.

UPDATE 3

I learned from the client that it's SQL Server 2008 that's exporting this strange CSV - which has been reported to Microsoft here and here

UPDATE 4

@dbenhur's answer worked perfectly for what I originally asked for, but pointed out that I neglected to show lines with both commas and quotes. I will accept d@benhur's answer - but I'm hoping it can be improved to work on all lines above.

HOPEFULLY FINAL UPDATE

This code works (and I would consider it "semantically correct"):

QUOTED = /"((?:[^"]|(?:""(?!")))*)"/
SEPQ = /,(?! )/
UNQUOTED = /([^,]*)/
SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/
FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/

def parse_sql_server_2008_csv_line(line)
  line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') }
end

Adapted from @dbenhur and @ghostdog74's answer in How can I process a CSV file with "bad commas"?

解决方案

The following uses regexp and String#scan. I observe that in the broken CSV format you're dealing with, that " only has quoting properties when it comes at the beginning and end of a field.

Scan moves through the string successively matching the regexp, so the regexp can assume its start match point is the beginning of a field. We construct the regexp so it can match a balanced quoted field with no internal quotes (QUOTED) or a string of non-commas (UNQUOTED). When either alternative field representation is matched, it must be followed by a separator which can be either comma or end of string (SEP)

Because UNQUOTED can match a zero length field before a separator, the scan always matches an empty field at the end which we discard with [0...-1]. Scan produces an array of tuples; each tuple is an array of the capture groups, so we map over each element picking the captured alternate with matches[0] || matches[1].

None of your example lines show a field which contains both a comma and a quote -- I have no idea how it would be legally represented and this code probably wont recognize such a field correctly.

SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/

FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/

def ugly_parse line
  line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

lines.each do |l|
  puts l
  puts ugly_parse(l).inspect
  puts
end

# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
# 
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
# 
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]

这篇关于从SQL Server 2008解析CSV的语义正确的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆