使用Hive SQL提取不同字符之间的字符串 [英] Extracting strings between distinct characters using hive SQL
问题描述
我有一个名为geo_data_display的字段,其中包含国家,地区和dma.这三个值包含在=和&之间.字符-第一个"="和第一个&"之间的国家/地区,第二个"="和第二个&"之间的区域和第三个"="和第三个&"之间的DMA.这是表格的可复制版本.国家/地区始终是字符,但地区和DMA可以是数字或字符,并且DMA并非在所有国家/地区都存在.
I have a field called geo_data_display which contains country, region and dma. The 3 values are contained between = and & characters - country between the first "=" and the first "&", region between the second "=" and the second "&" and DMA between the third "=" and the third "&". Here's a re-producible version of the table. country is always character but region and DMA can be either numeric or character and DMA doesn't exist for all countries.
一些样本值是:
country=us®ion=tx&dma=625&domain=abc.net&zipcodes=76549
country=us®ion=ca&dma=803&domain=abc.com&zipcodes=90404
country=tw®ion=hsz&domain=hinet.net&zipcodes=300
country=jp®ion=1&dma=a&domain=hinet.net&zipcodes=300
我有一些示例SQL,但是geo_dma代码行根本不起作用,geo_region代码行仅适用于字符值
I have some sample SQL but the geo_dma code line isn't working at all and the geo_region code line only works for character values
SELECT
UPPER(REGEXP_REPLACE(split(geo_data_display, '\\&')[0], 'country=', '')) AS geo_country
,UPPER(split(split(geo_data_display, '\\&')[1],'\\=')[1]) AS geo_region
,split(split(cast(geo_data_display as int), '\\&')[2],'\\=')[2] AS geo_dma
FROM mytable
推荐答案
regexp_extract(字符串主题,字符串模式,整数索引)
返回使用模式提取的字符串.例如,regexp_extract('foothebar','foo(.*?)(bar)',1)返回'the'
Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 1) returns 'the'
select
regexp_extract(geo_data_display, 'country=(.*?)(®ion)', 1),
regexp_extract(geo_data_display, 'region=(.*?)(&dma)', 1),
regexp_extract(geo_data_display, 'dma=(.*?)(&domain)', 1)
这篇关于使用Hive SQL提取不同字符之间的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!