如何计算字符串中不包括引号的分隔符的出现次数 [英] How to count occurrences of separator in string excluding those in quotes

查看:115
本文介绍了如何计算字符串中不包括引号的分隔符的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个以逗号分隔的文件,如下所示:

  1.42104E + 16,220899,1, page remote auto ,,,Allied Martian Banks,PLC,Moon,MN ,, 
1.42105E + 16,637039,1,,e-page remote auto ,,, Bank of Jupiter,Europa,IO ,,

我想计算除引号外的逗号数,例如Allied Martian Banks, PLC。



我知道:

  data_record)-length(replace(i.data_record,',',''))

逗号的数量,但是在第一行中,与第二行相比,这将计算一个额外的逗号,为了我的目的,它们应该被计为具有相同的数字。



有没有任何快速简单的方法忽略引号中的逗号?



我知道我可以创建一个循环,并开始打破字符串的位数,每当我找到一个引号,忽略任何逗号,直到我找到另一个引号,但我想知道是否有任何更简单,更简化的方式实现这一点,而不诉诸循环。



非常感谢!

解决方案

先消除分隔的内容,然后计数:

  regexp_count(
regexp_replace(
regexp_replace(
i.data_record
,'(^ |,)[^ ] *(,| $)'
,'\1\2'

,'(^ |,) '
,'\1\2'

,','

为了正确处理连续的引号分隔字段,不幸地需要嵌套 regexp_replace 调用:任何分隔逗号都被regexp模式,因此不会考虑到后续的匹配。



Oracle的regexen不支持lookahead操作符,这是处理这种情况的自然方式。 p>

鉴于regexp _...调用的性能影响,您可能最好使用

  length(i.data_record) -  length(replace(regexp_replace(i.data_record,'(^ |,)[^] *(,| $)','\1\2' ,',',''))

/ strong>



此解决方案不处理字段值中的dquotes,通常表示为 code> \。



前一种情况可以优雅地处理:而不是解释一个 在引号分隔的字段内,将整个字段内容视为1个或多个不包含dquotes的dquote分隔字符串的并置。虽然在处理数据时你不会遵循这个路线(所有的dquotes都会丢失),你可以使用这个透视图来计算:

  regexp_count(
regexp_replace(
regexp_replace(
i.data_record
,'(^ |,)([^] *)+ $' -
,'\1\3' - 已更改

,'(^ |,)( | $)' - changed
,'\1\3' - changed

,','



测试用例

   -  works 
select regexp_count(regexp_replace(regexp_replace('1,data,and more so,more data,and more more',' ,)[^] *(,| $)','\1\2'),' \2'),',')from dual;
select regexp_count(regexp_replace(regexp_replace('1,data,and more so,2,more data,and more more','(^ |,)[^] * | $)','\1 \2'),'(^ |,)[^] *(,| $)','\1\2'双;

select regexp_count(regexp_replace(regexp_replace('1,data,and more,2,more data,and more more','(^ | [^] *)+(,| $),'\1 \3' \1\3'),',')from dual;

- failed
select regexp_replace('1,data,and more so,more data,and more more','(^ |,)[ ^] *(regexp_replace('1,data,and more so,2,...) ,more data,and more more','(^ |,)[^] *(,| $)','\1\2'


I have a file with comma separated values like this:

1.42104E+16,220899,1,,e-page remote auto,,,"Allied Martian Banks, P.L.C.",Moon,MN,,
1.42105E+16,637039,1,,e-page remote auto,,,Bank Of Jupiter,Europa,IO,,

I would like to count the number of commas excluding those in quotation marks such as "Allied Martian Banks, P.L.C.".

I know that:

length(i.data_record)-length(replace(i.data_record,',',''))

would return the number of commas, but this would count an extra comma in the 1st line compared to the 2nd when, for my purposes, they should be counted as having the same number.

Is there any quick and simple way of ignoring the commas in quotation marks?

I understand I could create a loop and start breaking the string in bits, counting them, and whenever I find a quotation mark ignore any commas until I find another quotation mark, however I would like to know if there's any simpler, more streamlined way of achieving this without resorting to loops.

Many thanks!

解决方案

Eliminate the delimited content first, count afterwards:

regexp_count (
    regexp_replace (
        regexp_replace (
            i.data_record
          , '(^|,)"[^"]*"(,|$)'
          , '\1\2'
        )
      , '(^|,)"[^"]*"(,|$)'
      , '\1\2'
    )
  , ',' 
) 

The nesting of regexp_replace calls is unfortunately necessary in order to handle consecutive quote-delimited fields correctly: any separating comma is consumed by the regexp pattern and thus willnot be taken into account for the subsequent match.

Oracle's regexen do not support the lookahead operator which would be the natural way to handle this situation.

Given the performance hit of regexp_... calls you might be better off to use

length(i.data_record) - length ( replace ( regexp_replace ( i.data_record, '(^|,)"[^"]*"(,|$)', '\1\2' ),',','' ) )

Caveat

This solution does not handle dquotes within field values, which are usually represented as "" or \".

The former case can be handled elegantly: Instead of interpreting a "" inside a quote-delimited field, consider the whole field content as a juxtaposition of 1 or more dquote-delimited strings that do not contain dquotes. While you wouldn't follow this route in processing the data (all dquotes would be lost), you may employ this perspective for the sake of counting:

regexp_count (
    regexp_replace (
        regexp_replace (
            i.data_record
          , '(^|,)("[^"]*")+(,|$)'  -- changed
          , '\1\3'                  -- changed
        )
      , '(^|,)("[^"]*")+(,|$)'   -- changed
      , '\1\3'                   -- changed
    )
  , ',' 
) 

Test cases

-- works
select regexp_count ( regexp_replace ( regexp_replace ( '1,"data,and more so","more data,and even more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) from dual;
select regexp_count ( regexp_replace ( regexp_replace ( '1,"data,and more so",2,"more data,and even more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) from dual;

select regexp_count ( regexp_replace ( regexp_replace ( '1,"""data"",and more so",2,"more data,and even more so"', '(^|,)("[^"]*")+(,|$)', '\1\3' ), '(^|,)("[^"]*")+(,|$)', '\1\3' ), ',' ) from dual;

-- fails
select regexp_count ( regexp_replace ( '1,"data,and more so","more data,and even more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) from dual;
select regexp_count ( regexp_replace ( '1,"data,and more so",2,"more data,and even more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) from dual;

这篇关于如何计算字符串中不包括引号的分隔符的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆