计数包含字母/数字的行数 [英] Count the number of rows that contain a letter/number

查看:113
本文介绍了计数包含字母/数字的行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要实现的是直接的,但是它有点难以解释,我不知道它是否实际上甚至可能在postgres。我在一个相当基础的水平。 SELECT,FROM,WHERE,LEFT JOIN ON,HAVING 等基本资料。





即,有多少行有包含a / A的条目,其中包含一个字母/数字, (不区分大小写)



我查询的表是电影名称列表。所有我想做的是组和计数'a-z'和'0-9',并输出总计。我可以顺序执行36个查询:

  SELECT filmname FROM films WHERE filmname ilike'%a%'
SELECT filmname FROM films WHERE filmname ilike'%b%'
SELECT filmname from films WHERE filmname ilike'%c%'

然后在结果上运行pg_num_rows以查找我需要的数字,等等。



我知道如此密集,喜欢避免。虽然数据(下面)在数据中有大写和小写,我希望结果集不区分大小写。即盯着山羊的男人,a / A,t / T和s / S不会对结果集计数两次。我可以将表复制到一个辅助工作表,所有的数据都是strtolower,如果它使得查询更简单或更容易构建,那么就可以查询该组数据。



另一种方法可以是像


SELECT sum(length(regexp_replace(filmname,'[^ X | ^ x]','','g ')))FROM films;


对于每个字母组合,但再次36查询,36个数据集,我更喜欢如果我可以获得单个数据



这是一组14套电影的简短数据集(实际上包含275行)

 区9 
代表
发明说谎
Pandorum
UP
独奏者
多云有机会肉丸
帕纳塞斯博士的想象力
Cirque du Freak:吸血鬼助手
Zombieland
9
盯着山羊的男人
A圣诞卡$ b超常活动

如果我在列中手动布置每个字母和数字,字母出现在电影标题中,在该列中给出一个x,然后将其计数以产生一个总数,我会有下面这样的。



上面短集的结果是:x的每个垂直列是该影片名中字母的列表,不管该字母出现的次数或其大小写。

  A xx xxxx xxx 9 
B xx 2
C x xxx xx 6
xx xxxx 6
E xx xxxxx x 8
F x xxx 4
G xx xx 4
H x xxxx xx 7
I xx xxxxx xx 9
J 0
K x 0
L x xx x xx 6
M x xxxx xxx 8
N xx xxxx xx 8
O xxx xxx x xxx 10
P xx xx x 5
Q x 1
R xx x xx xxx 7
S xx xxxx xx 8
T xxx xxxx xxx 10
U x xx xxx 6
V xxx 3
W xx 2
X 0
Y xxx 3
Z x 1
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 xx 1

在上面的例子中,每一列都是一个filmname如你所见,第5列只标记一个u p并且列11仅标记9。



我想以某种方式构建一个查询,给出结果行:A 9,B 2,C 6,D 6 ,E 8等,考虑从我的电影列提取的每一行条目。如果那个字母没有出现在任何一行我想要一个零。



我不知道这是否是可能的,或者是否在PHP中系统地36个查询是唯一的可能性。



在当前数据集中有275个条目,它每月增长约8.33(每年100个)。我预计到2019年,它将达到大约1000行,这时候我毫无疑问使用一个完全不同的系统,所以我不需要担心使用一个巨大的数据集拖曳。



目前最长的标题是Percy Jackson& the Olympians:The Lightning Thief,50个字符(是的,我认识的电影很糟糕;-),最短的是1,9。



我正在运行版本9.0.0的Postgres。



如果我在多个方面多次说同样的事情,我试图获得尽可能多的信息,所以你知道我想要实现的。



如果你需要任何澄清或更大的数据集测试,请问和我将根据需要进行编辑。



建议是非常受欢迎的。



em>



Erwin 感谢您编辑/标记/建议。



修正了 Erwin 建议的缺少的9错字。手动转录错误。


kgrittn ,感谢您的建议,但我无法从9.0.0更新版本。



感谢您的答复 Erwin



抱歉回应,但我一直在努力让您的查询工作和学习新的关键字以了解您创建的查询。



我调整了查询​​以适应我的表结构,但结果集不是预期的(全零),所以我直接复制您的行并且具有相同的结果。



虽然在两种情况下的结果集都列出了具有适当字母/数字的所有36行,但是所有行都显示零作为计数(ct) 。



我已经尝试解构查询以查看可能出现的位置。




  SELECT DISTINCT id,unnest(string_to_array(lower(film),NULL))AS letter 
FROM films



是No rows found。



当我移除unnest函数时,结果是14行,所有的行都为NULL p>

如果调整函数

  COALESCE ct,0)to COALESCE(y.ct,4)< br /> 

那么我的数据集对于每个字母都返回4,而不是零。

在COALESCE上简单地读起来,4是替换值我猜测y.ct是NULL并且被替换为第二个值(这是为了覆盖字母在序列中不匹配,即如果没有电影包含'q',则'q'列将具有零值而不是NULL?)



数据库I尝试这是SQL_ASCII,我想知道是不是某种问题,但我有一个运行的版本8.4.0与UTF-8相同的结果。



如果我



任何想法?



解决方案

此查询应该完成以下工作:



测试用例:

  CREATE TEMP TABLE电影
INSERT INTO电影(电影)VALUES
('District 9')
,('Surrogates')
('The Invention Of Lying')
, 'Pandorum')
,('UP')
,('The Soloist')
('Cloudy With A Chance Of Meatballs')
,('The Imaginarium of Parnassus博士)
,('Cirque du Freak:The Vampires Assistant')
,('Zombieland')
,('9')
,山羊')
,('A Christmas Carol')
,('Paranormal Activity');

查询:

  SELECT l.letter,COALESCE(y.ct,0)AS ct 
FROM(
SELECT chr(generate_series(97,122))AS letter - az in UTF8!
UNION ALL
SELECT generate_series(0,9):: text - 0-9
)l
LEFT JOIN(
SELECT letter,count(id)AS ct
FROM(
SELECT DISTINCT - 每个字母计数一次电影
id,unnest(string_to_array(lower(film),NULL))AS letter
FROM films
)x
GROUP BY 1
)y USING(letter)
ORDER BY 1;






更改string_to_array(),因此NULL分隔符将字符串拆分为
个字符(Pavel Stehule)



以前,这返回了一个空值。





What I am trying to achieve is straightforward, however it is a little difficult to explain and I don't know if it is actually even possible in postgres. I am at a fairly basic level. SELECT, FROM, WHERE, LEFT JOIN ON, HAVING, e.t.c the basic stuff.

I am trying to count the number of rows that contain a particular letter/number and display that count against the letter/number.

i.e How many rows have entries that contain an "a/A" (Case insensitive)

The table I'm querying is a list of film names. All I want to do is group and count 'a-z' and '0-9' and output the totals. I could run 36 queries sequentially:

SELECT filmname FROM films WHERE filmname ilike '%a%'
SELECT filmname FROM films WHERE filmname ilike '%b%'
SELECT filmname FROM films WHERE filmname ilike '%c%'

And then run pg_num_rows on the result to find the number I require, and so on.

I know how intensive like is and ilike even more so I would prefer to avoid that. Although the data (below) has upper and lower case in the data, I want the result sets to be case insensitive. i.e "The Men Who Stare At Goats" the a/A,t/T and s/S wouldn't count twice for the resultset. I can duplicate the table to a secondary working table with the data all being strtolower and working on that set of data for the query if it makes the query simpler or easier to construct.

An alternative could be something like

SELECT sum(length(regexp_replace(filmname, '[^X|^x]', '', 'g'))) FROM films;

for each letter combination but again 36 queries, 36 datasets, I would prefer if I could get the data in a single query.

Here is a short data set of 14 films from my set (which actually contains 275 rows)

District 9
Surrogates
The Invention Of Lying
Pandorum
UP
The Soloist
Cloudy With A Chance Of Meatballs
The Imaginarium of Doctor Parnassus
Cirque du Freak: The Vampires Assistant
Zombieland
9
The Men Who Stare At Goats
A Christmas Carol
Paranormal Activity

If I manually lay out each letter and number in a column and then register if that letter appears in the film title by giving it an x in that column and then count them up to produce a total I would have something like this below. Each vertical column of x's is a list of the letters in that filmname regardless of how many times that letter appears or its case.

The result for the short set above is:

A  x x  xxxx xxx  9 
B       x  x      2 
C x     xxx   xx  6
D x  x  xxxx      6
E  xx  xxxxx x    8
F   x   xxx       4 
G  xx    x   x    4
H   x  xxxx  xx   7
I x x  xxxxx  xx  9
J                 0
K         x       0
L   x  xx  x  xx  6
M    x  xxxx xxx  8
N   xx  xxxx x x  8
O  xxx xxx x xxx  10
P    xx  xx    x  5
Q         x       1
R xx x   xx  xxx  7
S xx   xxxx  xx   8
T xxx  xxxx  xxx  10
U  x xx xxx       6
V   x     x    x  3
W       x    x    2
X                 0 
Y   x   x      x  3
Z          x      1 
0                 0  
1                 0  
2                 0 
3                 0
4                 0
5                 0
6                 0
7                 0
8                 0
9 x         x     1

In the example above, each column is a "filmname" As you can see, column 5 marks only a "u" and a "p" and column 11 marks only a "9". The final column is the tally for each letter.

I want to build a query somehow that gives me the result rows: A 9, B 2, C 6, D 6, E 8 e.t.c taking into account every row entry extracted from my films column. If that letter doesn't appear in any row I would like a zero.

I don't know if this is even possible or whether to do it systematically in php with 36 queries is the only possibility.

In the current dataset there are 275 entries and it grows by around 8.33 a month (100 a year). I predict it will reach around 1000 rows by 2019 by which time I will be no doubt using a completely different system so I don't need to worry about working with a huge dataset to trawl through.

The current longest title is "Percy Jackson & the Olympians: The Lightning Thief" at 50 chars (yes, poor film I know ;-) and the shortest is 1, "9".

I am running version 9.0.0 of Postgres.

Apologies if I've said the same thing multiple times in multiple ways, I am trying to get as much information out so you know what I am trying to achieve.

If you need any clarification or larger datasets to test with please just ask and I'll edit as needs be.

Suggestion are VERY welcome.

Edit 1

Erwin Thanks for the edits/tags/suggestions. Agree with them all.

Fixed the missing "9" typo as suggested by Erwin. Manual transcribe error on my part.

kgrittn, Thanks for the suggestion but I am not able to update the version from 9.0.0. I have asked my provider if they will try to update.

Response

Thanks for the excellent reply Erwin

Apologies for the delay in responding but I have been trying to get your query to work and learning the new keywords to understand the query you created.

I adjusted the query to adapt into my table structure but the result set was not as expected (all zeros) so I copied your lines directly and had the same result.

Whilst the result set in both cases lists all 36 rows with the appropriate letters/numbers however all the rows shows zero as the count (ct).

I have tried to deconstruct the query to see where it may be falling over.

The result of

SELECT DISTINCT id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM  films


is "No rows found". Perhaps it ought to when extracted from the wider query, I'm not sure.

When I removed the unnest function the result was 14 rows all with "NULL"

If I adjust the function

COALESCE(y.ct, 0) to COALESCE(y.ct, 4)<br />

then my dataset responds all with 4's for every letter instead of zeros as explained previously.

Having briefly read up on COALESCE the "4" being the substitute value I am guessing that y.ct is NULL and being substituted with this second value (this is to cover rows where the letter in the sequence is not matched, i.e if no films contain a 'q' then the 'q' column will have a zero value rather than NULL?)

The database I tried this on was SQL_ASCII and I wondered if that was somehow a problem but I had the same result on one running version 8.4.0 with UTF-8.

Apologies if I've made an obvious mistake but I am unable to return the dataset I require.

Any thoughts?

Again, thanks for the detailed response and your explanations.

解决方案

This query should do the job:

Test case:

CREATE TEMP TABLE films (id serial, film text);
INSERT INTO films (film) VALUES
 ('District 9')
,('Surrogates')
,('The Invention Of Lying')
,('Pandorum')
,('UP')
,('The Soloist')
,('Cloudy With A Chance Of Meatballs')
,('The Imaginarium of Doctor Parnassus')
,('Cirque du Freak: The Vampires Assistant')
,('Zombieland')
,('9')
,('The Men Who Stare At Goats')
,('A Christmas Carol')
,('Paranormal Activity');

Query:

SELECT l.letter, COALESCE(y.ct, 0) AS ct
FROM  (
    SELECT chr(generate_series(97, 122)) AS letter  -- a-z in UTF8!
    UNION ALL
    SELECT generate_series(0, 9)::text              -- 0-9
    ) l
LEFT JOIN (
    SELECT letter, count(id) AS ct
    FROM  (
        SELECT DISTINCT  -- count film once per letter
               id, unnest(string_to_array(lower(film), NULL)) AS letter
        FROM   films
        ) x
    GROUP  BY 1
    ) y  USING (letter)
ORDER  BY 1;

Change string_to_array() so a NULL separator splits the string into characters (Pavel Stehule)

Previously this returned a null value.

  • You can use regexp_split_to_table(lower(film), ''), instead of unnest(string_to_array(lower(film), NULL)) (works in versions pre-9.1!), but it is typically a bit slower and performance degrades with long strings.

  • I use generate_series() to produce the [a-z0-9] as individual rows. And LEFT JOIN to the query, so every letter is represented in the result.

  • Use DISTINCT to count every film once.

  • Never worry about 1000 rows. That is peanuts for modern day PostgreSQL on modern day hardware.

这篇关于计数包含字母/数字的行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆