如何将逗号分隔的文件转换为管道分隔在vb.net [英] How to convert comma-delimited file to pipe-delimited in vb.net

查看:192
本文介绍了如何将逗号分隔的文件转换为管道分隔在vb.net的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多搜索结果在网络上(和在SO)的类似与我需要做的,但我还没有遇到一个解决方案,但对于我的特殊情况。 / p>

我有一个以逗号分隔的文件,其中只有在其中有逗号的列中有双引号。



以这个例子:

  123,box,toy,phone,red,car,cat,dog,bike,pencil,man,africa,yellow,jump,rope

该行的输出必须是:

  123 | box,toy | phone |红色,汽车,猫,狗|自行车,铅笔| man | africa | yellow | jump,rope 
p $ p>

我现在有这个代码:

 使用sr作为新的StreamReader (csvFilePath)
Dim line As String =
Dim strReplacerQuoteCommaQuote As String = Chr(34)& ,& Chr(34)
Dim strReplacerQuoteComma As String = Chr(34)& ,
Dim strReplacerCommaQuote As String =,& Chr(34)

Do While sr.Peek<> -1
line = sr.ReadLine
line = Replace(line,strReplacerQuoteCommaQuote,|)
line = Replace(line,strReplacerQuoteComma,|)
line = Replace (line,strReplacerCommaQuote,|)
line = Replace(line,Chr(34),)

Console.WriteLine(line:& line)
循环
结束使用

该进程的问题是当我到达第四个Replace ()行字符串看起来像这样:

  123 | box,toy | phone | red,car,cat,dog | bike ,铅笔|男子,非洲,黄|跳,绳子



<但我不能只是做一个替换所有的逗号,显然。



我该怎么做?是否有可以处理此问题的RegEx语句?



更新工作代码



Avinash的评论中的的正则表达式解释,并确保upvote @ mathematicscoffee的答案在那里,而你在它,因为我基本上只偷了他的正则表达式。



编辑:
关于您的效果问题,我创建了一个包含90k行的文件:

  ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,黄,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ

这大致相当于35MB的文件大小,我的笔记本电脑(没有什么特别的)会在大约6.5秒解析。



是的,正则表达式很慢,TextFieldParser类也被广泛报告为不是最快的,但是如果您仍然处理超过5分钟,您的代码中显然还有其他瓶颈。请注意,我实际上并不对解析的结果做任何事情。



编辑2:



时间到了,我创建了一个包含150k行的输入文件的:

  abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,ABCDEFGHIJKLMNOPQRSTUVWXYZ,ABCDEFGHIJKLMNOPQRSTUVWXYZ

每个行有1140个字符,总文件。



使用以下代码读取,转换和写回到新文件需要 29 秒。 b
$ b

  Dim line,result As String 
Dim replace As String =,(?=([^] * (d:\output.txt)
使用sr作为新的StreamReader(d:\\ \input.txt)
While not sr.EndOfStream
line = sr.ReadLine
result = Regex.Replace(line,replace,Function(m)m.Value.Replace ,,|))
sw.WriteLine(result.Replace(Chr(34),))
结束时
结束使用
结束使用

编辑3:使用@ sln的正则表达式,文件到 4 秒。

  =,([^,] *(?:[^] *)?[^,] *) =| $ 1
Dim rgx As New Regex(pattern)
使用sw作为新的StreamWriter(d:\output.txt)
使用sr作为新的StreamReader(d: \input.txt)
While not sr.EndOfStream
line = sr.ReadLine
result = rgx.Replace(line,replacement)
sw.WriteLine(result.Replace (Chr(34),))
结束时
结束使用
结束使用

所以你走了,我想你有一个赢家。作为sln状态,这是一个相对测试,因此机器速度是不相关的。

 ,(?=([^] * [^] *)* [^] *)花了29秒
,([^,] ?=,| $)花费4秒

最后(仅仅为了完整性) jawood2005是非常可行的:

  Dim line As String 
Dim fields As String()
使用sw As新的StreamWriter(d:\output.txt)
使用tfp作为新的FileIO.TextFieldParser(d:\input.txt)
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String(){,}
tfp.HasFieldsEnclosedInQuotes = True
而不是tfp.EndOfData
fields = tfp.ReadFields
line = String。加入(|,字段)
sw.WriteLine(line.Replace(Chr(34),))
结束时
结束使用
结束使用

使用与正则表达式解决方案相同的150k行输入文件,这会在 18 秒内完成,所以比我更好,但sln赢得了最快的解决方案的奖你的问题。


There are lots of search results on the web (and in SO) for something similar to what I need to do, but I haven't run into a solution yet for my particular situation.

I have a comma-delimited file in which only columns that have commas within them have double quotes around them. Other fields that have no comma in them are simply separated by the comma.

Take this example:

123,"box,toy",phone,"red,car,cat,dog","bike,pencil",man,africa,yellow,"jump,rope"

The output for that line needs to be:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man|africa|yellow|jump,rope

I have this code currently:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        line = Replace(line, strReplacerQuoteCommaQuote, "|")
        line = Replace(line, strReplacerQuoteComma, "|")
        line = Replace(line, strReplacerCommaQuote, "|")
        line = Replace(line, Chr(34), "")

        Console.WriteLine("line: " & line)
    Loop
End Using

The problem with that process is when I get to the fourth Replace() line the string looks like this:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man,africa,yellow|jump,rope

So man and africa need pipes after them, but I can't just do a Replace on all commas, obviously.

How can I do this? Is there a RegEx statement that could handle this?

UPDATE With Working Code

The link in Avinash's comment had the answer that I went with. I Imported System.Text.RegularExpressions and used the following:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
        Dim replacement As String = "|"
        Dim regEx As New Regex(pattern)

        Dim newLine As String = regEx.Replace(line, replacement)
        newLine = newLine.Replace(Chr(34), "")

        Console.WriteLine("newLine: " & newLine)
    Loop
End Using

解决方案

This seems to work for your example:

Dim result = Regex.Replace(input, ",(?=([^""]*""[^""]*"")*[^""]*$)", Function(m) m.Value.Replace(",", "|"))
result = result.Replace(Chr(34), "")

See the accepted answer here for the regex explanation and be sure to upvote @mathematical.coffee 's answer there while you're at it as I basically just stole his regex.

Edit: Regarding your performance issue, I created a file with 90k lines of:

abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,yellow,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

which roughly equals a file size of 35MB, my laptop (nothing special) will parse that in around 6.5 seconds.

Yes, the regex is slow and the TextFieldParser class is also widely reported as not being the quickest but you obviously have some other bottleneck in your code if you are still processing for over 5 minutes. Please note that I am not actually doing anything with the parsed result though.

Edit 2: Ok, I thought I'd have one last go at this (I was bored this morning) but I still can't replicate your extended conversion time.

Time to get brutal, I created an input file with 150k lines of:

abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

Each line has 1140 characters, total file size ~167MB.

Reading, converting and writing back out to a new file using the following code took 29 seconds.

Dim line, result As String
Dim replace As String = ",(?=([^""]*""[^""]*"")*[^""]*$)"
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = Regex.Replace(line, replace, Function(m) m.Value.Replace(",", "|"))
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

Edit 3: Using @sln's regex and this code cuts the processing time for the same file to 4 seconds.

Dim line, result As String
Dim pattern As String = ",([^,""]*(?:""[^""]*"")?[^,""]*)(?=,|$)"
Dim replacement As String = "|$1"
Dim rgx As New Regex(pattern)
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = rgx.Replace(line, replacement)
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

So there you go, I think you have a winner. As sln states, this is a relative test so machine speed is irrelevant.

,(?=([^"]*"[^"]*")*[^"]*$)          took 29 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)  took 4 seconds

Finally (and just for completeness) the solution proposed by @jawood2005 is very workable:

Dim line As String
Dim fields As String()
Using sw As New StreamWriter("d:\output.txt")
    Using tfp As New FileIO.TextFieldParser("d:\input.txt")
        tfp.TextFieldType = FileIO.FieldType.Delimited
        tfp.Delimiters = New String() {","}
        tfp.HasFieldsEnclosedInQuotes = True
        While Not tfp.EndOfData
            fields = tfp.ReadFields
            line = String.Join("|", fields)
            sw.WriteLine(line.Replace(Chr(34), ""))
        End While
    End Using
End Using

Using the same 150k row input file as the regex solutions this completes in 18 seconds, so better than mine but sln wins the prize for the fastest solution to your problem.

这篇关于如何将逗号分隔的文件转换为管道分隔在vb.net的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆