从SQL Server 2008解析CSV的语义正确的方法是什么？ [英] What's a semantically-correct way to parse CSV from SQL Server 2008?

查看：148 发布时间：2017/2/24 18:36:49 ruby regex csv

本文介绍了从SQL Server 2008解析CSV的语义正确的方法是什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从SQL Server 2008得到一个CSV转储，有这样的行：

 管道，196222006P，REPLACE LEAD WATER SERVICE W / 1COPPER，1996-08-09 00:00:00 
 Construction，197133031B，MORGAN SHOESALT，1997-05-13 00:00:00 
 Electrical，197135021E，SERVICE ，OUTLETS，1997-05-15 00:00:00 
 Electrical，197135021E，SERVICE，OUTLETSFOOBAR，1997-05-15 00:00:00 
 Construction，198120036B ，MERITER，IT IT CTR，NCRANDTRACEALTERATION，1998-04-30 00:00:00

parse_dbenhur 很漂亮，但可以重写以支持逗号和引号

 ＃@ dbenhur的优秀答案，它工作100％为我最初要求
 SEP = /（？：，| \Z）/ 
 QUOTED = /（[^] *）/ 
 UNQUOTED = /（[^，] *）/ 
 FIELD = /（？：＃{QUOTED} |＃{UNQUOTED}）＃{SEP} / 
 def parse_dbenhur（line）
。 scan（FIELD）[0 ...- 1] .map {| matches | matches [0] ||匹配[1]} 
 end 
 
 def parse_ugly（line）
 dumb_fields = line.chomp.split（'，'）。 v.gsub（/ \s + /，''）} 
 fields = [] 
 open = false 
 dumb_fields.each_with_index do | v，i | 
打开？ fields.last.concat（v）：fields.push（v）
 open =（v.start_with？（''）和（v.count（''）％2 == 1）和dumb_fields [ i + 1]和dumb_fields [i + 1] .start_with？（））|| （'open'和'v.end_with？（''））
 end 
 fields.map {| v |（v.start_with？（''）和v.end_with？ ）？v [1 ..- 2]：v} 
 end 
 
 lines = [] 
 lines<<'Plumbing，196222006P，REPLACE LEAD WATER SERVICE W / 1COPPER，1996-08-09 00:00:00'
 lines<< 'Construction，197133031B，MORGAN SHOESALT，1997-05-13 00:00:00'
 lines<< 'Electrical，197135021E，SERVICE，OUTLETS，1997-05-15 00:00:00'
 lines<< 'Electrical，197135021E，SERVICE，OUTLETSFOOBAR，1997-05-15 00:00:00'
 lines< 'Construction，198120036B，MERITER，IT IT CTR，NCRANDTRACEALTERATION，1998-04-30 00:00:00'
 
 require'csv'
 lines.each do | line | 
 puts 
 puts line 
 begin 
c = CSV.parse_line（line）
 puts＃{c.to_csv.chomp}（size＃{c.length}） 
 rescue 
 putsFasterCSV说：＃{$！}
 end 
a = parse_ugly（line）
 puts＃{a.to_csv.chomp} size＃{a.length}）
b = parse_dbenhur（line）
 puts＃{b.to_csv.chomp}（size＃{b.length}）
 end

这是我运行时的输出：

 水暖，196222006P，REPLACE领导水务服务W / 1COPPER，1996-08-09 00:00:00 
 FasterCSV说：第1行的非法报价。
管道，196222006P，REPLACE LEAD WATER SERVICE W / 1COPPER，1996-08-09 00:00:00（尺寸4）
管道，196222006P，更换铅水服务W / 1 ，1996-08-09 00:00:00（size 4）
 
建筑，197133031B，摩根鞋ALT，1997-05-13 00:00:00 
 FasterCSV说： 
建筑，197133031B，MORGAN SHOESALT，1997-05-13 00:00:00（尺寸4）
建筑，197133031B， MORGAN SHOESALT，1997-05-13 00:00:00（尺寸4）
 
电气，197135021E，SERVICE，OUTLETS，1997-05-15 00:00： 00 
 FasterCSV说：第1行中缺少或错过报价
电气，197135021E，SERVICEOUTLETS，1997-05-15 00:00:00（尺寸4）
电气，197135021E，服务，OUTLETS，1997-05-15 00:00:00（尺寸5）
 
电气，197135021E， OUTLETSFOOBAR，1997-05-15 00:00:00 
 FasterCSV说：第1行中缺少或偏离的报价
电气，197135021E，SERVICEOUTLETSFOOBAR，1997-05 -15 00:00:00（size 4）
电气，197135021E，SERVICE，OUTLETSFOOBAR，1997-05-15 00:00:00（size 5） 
 
 Construction，198120036B，MERITER，IT IT CTR，NCRANDTRACEALTERATION，1998-04-30 00:00： 00 
 Construction，198120036B，MERITER，IT IT CTR，NCRANDTRACEALTERATION，1998-04-30 00:00:00 4）
 Construction，198120036B，MERITER，IT IT CTR，NCRANDTRACE ALTERATION，1998-04-30 00:00:00（size 6）
 Construction，198120036B，MERITER ，NCRANDTRACEALTERATION，1998-04-30 00:00:00（size 6）
  
 
 
   UPDATE  
 
 
 字段有逗号时使用双引号。
 
 
   UPDATE 2  
 
 
  
 
 
   UPDATE 3  
 
 
 我从客户端学到的是SQL Server 2008的导出此奇怪的CSV   - 已报告给Microsoft 此处和此处 
 
 
   UPDATE 4  
 
 
  @ dbenhur的答案完全适合我原先要求的答案，但指出我忽略了使用逗号和引号显示行。我会接受d @ benhur的回答 - 但我希望它可以改进以上所有行。
 
 
   HOPEFULLY FINAL UPDATE  / p> 
 
 
此代码适用于（我认为它语义正确）：
  QUOTED = /（（？：（？！）））*）/ 
 SEPQ = /，（?!）/ 
 UNQUOTED = /（[^，] *）/ 
 SEPU = /，（？=（？：[^] | （？：＃{QUOTED}＃{SEPQ}）|（？：＃{UNQUOTED}＃{SEPU}）| \Z / 
 
 def parse_sql_server_2008_csv_line（line）
 line.scan （FIELD）[0 ...- 1] .map {| matches |（matches [0] || matches [1]）。tr（'，'，''）.gsub（/ \s + /， ）} 
 end 
  
根据@dbenhur和@ ghostdog74的答案改写如何使用逗号处理CSV文件？ 
解决方案
以下使用regexp和  String＃scan  。我观察到，在你正在处理的破碎的CSV格式，只有当它在开始时的引号属性和字段。
 
 
 扫描在连续匹配regexp的字符串中移动，所以regexp可以假设它的开始匹配点是一个字段的开始。我们构造regexp，匹配未包含内部引号（ QUOTED ）或的平衡引号字段（ UNQUOTED ）当任何一个替代字段表示匹配时，它必须后跟一个分隔符，它可以是逗号或字符串的结尾（ SEP ）
 
 
 因为 UNQUOTED 可以在分隔符之前匹配零长度字段，所以扫描总是匹配我们丢弃的末尾的空字段 [0 ...- 1] 。Scan产生一个元组数组;每个元组都是一个捕获组数组，所以我们 map 匹配每个元素使用匹配[0] ||来捕获捕获的交替matches [1] 。
 
 
 您的示例行中没有显示包含逗号和引号的字段 - 我不知道
  SEP = /（？：，| \ Z）/ 
 QUOTED = /（[^] *）/ 
 UNQUOTED = /（[^，] *）/ 
 
 FIELD = {QUOTED} |＃{UNQUOTED}）＃{SEP} / 
 
 def ugly_parse line 
 line.scan（FIELD）[0 ...- 1] .map {| matches | matches [0] || matches [1]} 
 end 
 
 lines.each do | l | 
 puts l 
 puts ugly_parse（l）.inspect 
 puts 
 end 
 
＃电气，197135021E，SERVICE，OUTLETS，1997-05-15 00:00:00 
＃[Electrical，197135021E SERVICE，OUTLETS，1997-05-15 00:00:00] 
＃
＃管道，196222006P，REPLACE LEAD WATER SERVICE W / 1COPPER，1996-08-09 00： 00:00 
＃[Plumbing，196222006P，REPLACE LEAD WATER SERVICE W / 1\COPPER，1996-08-09 00:00:00] 
＃
＃Construction，197133031B，MORGAN SHOESALT，1997-05-13 00:00:00 
＃[Construction，197133031B，MORGAN SHOES\ALT， 05-13 00:00:00] 
  
 
I got a CSV dump from SQL Server 2008 that has lines like this:
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
parse_dbenhur is pretty, but can it be rewritten to support the presence of both commas and quotes? parse_ugly is, well, ugly.
# @dbenhur's excellent answer, which works 100% for what i originally asked for
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def parse_dbenhur(line)
  line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

def parse_ugly(line)
  dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') }
  fields = []
  open = false
  dumb_fields.each_with_index do |v, i|
    open ? fields.last.concat(v) : fields.push(v)
    open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"'))
  end
  fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v }
end

lines = []
lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00'
lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00'
lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00'

require 'csv'
lines.each do |line|
  puts
  puts line
  begin
    c = CSV.parse_line(line)
    puts "#{c.to_csv.chomp} (size #{c.length})"
  rescue
    puts "FasterCSV says: #{$!}"
  end
  a = parse_ugly(line)
  puts "#{a.to_csv.chomp} (size #{a.length})"
  b = parse_dbenhur(line)
  puts "#{b.to_csv.chomp} (size #{b.length})"
end
Here's the output when I run it:
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
FasterCSV says: Illegal quoting in line 1.
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)

Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
FasterCSV says: Unclosed quoted field on line 1.
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)

Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5)

Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5)

Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4)
Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
UPDATE

Note that the CSV uses double quotes when a field has a comma.

UPDATE 2

It's fine if commas are stripped out of the fields in question... my parse_ugly method doesn't preserve them.

UPDATE 3

I learned from the client that it's SQL Server 2008 that's exporting this strange CSV - which has been reported to Microsoft here and here

UPDATE 4

@dbenhur's answer worked perfectly for what I originally asked for, but pointed out that I neglected to show lines with both commas and quotes. I will accept d@benhur's answer - but I'm hoping it can be improved to work on all lines above.

HOPEFULLY FINAL UPDATE

This code works (and I would consider it "semantically correct"):
QUOTED = /"((?:[^"]|(?:""(?!")))*)"/
SEPQ = /,(?! )/
UNQUOTED = /([^,]*)/
SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/
FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/

def parse_sql_server_2008_csv_line(line)
  line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') }
end
Adapted from @dbenhur and @ghostdog74's answer in How can I process a CSV file with "bad commas"?
 解决方案 
The following uses regexp and String#scan. I observe that in the broken CSV format you're dealing with, that " only has quoting properties when it comes at the beginning and end of a field.

Scan moves through the string successively matching the regexp, so the regexp can assume its start match point is the beginning of a field.  We construct the regexp so it can match a balanced quoted field with no internal quotes (QUOTED) or a string of non-commas (UNQUOTED). When either alternative field representation is matched, it must be followed by a separator which can be either comma or end of string (SEP)

Because UNQUOTED can match a zero length field before a separator, the scan always matches an empty field at the end which we discard with [0...-1]. Scan produces an array of tuples; each tuple is an array of the capture groups, so we map over each element picking the captured alternate with matches[0] || matches[1].

None of your example lines show a field which contains both a comma and a quote -- I have no idea how it would be legally represented and this code probably wont recognize such a field correctly.
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/

FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/

def ugly_parse line
  line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

lines.each do |l|
  puts l
  puts ugly_parse(l).inspect
  puts
end

# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
# 
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
# 
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]


                        
这篇关于从SQL Server 2008解析CSV的语义正确的方法是什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

从SQL Server 2008解析CSV的语义正确的方法是什么？ [英] What's a semantically-correct way to parse CSV from SQL Server 2008?

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录关闭

从SQL Server 2008解析CSV的语义正确的方法是什么？ [英] What&#39;s a semantically-correct way to parse CSV from SQL Server 2008?

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

从SQL Server 2008解析CSV的语义正确的方法是什么？ [英] What's a semantically-correct way to parse CSV from SQL Server 2008?

登录关闭