根据值列表查询子字符串 [英] Querying substrings against a list of values
问题描述
很不幸,我正在从一个数据集读取数据,但我无权进行修改.它具有连接的值字符串,我想选择记录,这些记录的任何子字符串(由给定字符分隔)都与特定列表中的任何值匹配.我将通过Python传递查询,因此不会将其与静态列表进行比较.
I'm reading from a dataset that I unfortunately don't have the access to modify. It has concatenated strings of values, and I want to select records for which any of those substrings (as split by a given character) matches any of the values in a specific list. I'll be passing the queries in via Python, so it won't be compared against a static list.
例如,表格如下:
CrappyColumn
-----------
1;2
4
1
2;1
1;3
2
我可能想返回其中包含2或4的任何内容.所以,我的结果应该是:
And I might want to return anything that has 2 or 4 in it. So, my result should be:
1;2
4
2
2;1
我玩过regexp_substr并得到了一些实际有用的东西;但是,当我在完整的数据集上运行它时,它将无限期地运行(最多要放弃10分钟)(该数据集仅包含大约三千条记录,其值通常长达几百个字符).我需要在合理的时间内可以重复执行的内容.
I have played with regexp_substr and gotten something that actually works; however, it just runs indefinitely (as much as 10 minutes before I give up) when I run it on the full dataset (which only includes about three thousand records with values that are often a couple hundred characters long). I need something that works in a reasonable amount of time for repeated execution.
我意识到,即使有了变量比较列表,我也可以编写Python代码来解析列表并构造多个LIKE语句,但这似乎效率不高,我认为有更好的方法
I realize that--even with a variable comparison list--I could just write my Python code to parse the list and construct multiple LIKE statements, but that seems inefficient, and I assume that there is a better way.
这是我做的耗时太长的事情:
And here's what I've done that takes too long:
SELECT DISTINCT CrappyColumn
FROM
(SELECT DISTINCT CrappyColumn, regexp_substr(CrappyColumn, '[^;]+', 1, LEVEL) as UGH
FROM CrappyTable
CONNECT BY regexp_substr(CrappyColumn, '[^;]+', 1, LEVEL) IS NOT NULL)
WHERE UGH IN ('2', '4')
是否有更好,更快,更清洁的方式来实现这一目标?
Is there a better, faster, cleaner way to accomplish this?
编辑-解决方法:
感谢vkp的帮助,这是我实现的:
Thanks to vkp's help, here is what I implemented:
regexp_like(SITE_ID, '^(2|4)(:)|(:)(2|4)(:)|(:)(2|4)$|^(2|4)$')
我将其修改为最终产品,以便它可以处理多个字符(通过将[2|4]
更改为(2|4)
).在搜索非一位数字的情况下,此方法有效.
I modified it for my final product, so that it can handle strings of more than one character--by changing [2|4]
to (2|4)
. This works in cases of searching for numbers that aren't single-digit.
推荐答案
一种简单的方法是使用regexp_like
检查列表中是否包含2
或4
.
A simpler method would be to use regexp_like
to check if the list has 2
or 4
in it.
select *
from tablename
where regexp_like(crappycolumn,'^[2|4][^0-9]|[^0-9][2|4][^0-9]|[^0-9][2|4]$|^[2|4]$')
-
^[2|4][^0-9]
-以2或4开头,后跟数字. -
[^0-9][2|4][^0-9]
-2或4后面没有数字. -
[^0-9][2|4]$
-以2或4结尾,不带数字. -
^[2|4]$
-2或4是字符串中的唯一字符. ^[2|4][^0-9]
- Starts with 2 or 4 not followed by a digit.[^0-9][2|4][^0-9]
- 2 or 4 not succeeded or preceded by a digit.[^0-9][2|4]$
- Ends with 2 or 4 not preceded by a digit.^[2|4]$
- 2 or 4 is the only character in the string.
这篇关于根据值列表查询子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!