使用PostgreSQL函数进行数据分析 [英] Data Profiling using PostgreSQL Function
问题描述
我正在尝试使用Postgres函数进行一些数据分析。我试过下面的函数导致错误。由于我是数据库功能,程序等新手。我发现很难解决这个问题。
实际工作:
我想循环遍历表中的所有列并执行数据分析,即计数,计数不同,空值,而不是字符列的空值。数字和符号的最小值,最大值日期栏。请帮助
pre $ CREATE OR REPLACE FUNCTION data_profiling(TABLE_VALUE VARCHAR)
RETURNS TABLE(
col_value VARCHAR,
DISTINCT_COUNT INT
)
AS $$
DECLARE
var_c Varchar;
BEGIN
FOR VAR_C IN(SELECT c.column_name,c.table_name
FROM information_schema.columns c
WHERE lower(c.table_name)= TABLE_VALUE)
LOOP
RETURN QUERY EXECUTE'SELECT'|| var_c ||'作为col_name,count(distinct'|| var_c ||')作为distinct_count
FROM'|| TABLE_VALUE || 'group by'|| var_c;
END LOOP;
END; $$
LANGUAGE'plpgsql';
错误:
错误:查询结构与函数结果类型不匹配
细节:返回类型字符(50)与列1中不同的预期类型字符不匹配。
CONTEXT:PL / pgSQL函数data_profiling(字符变化)第10行返回QUERY
游标返回一个记录不是varchar,您需要将您的声明更改为:
var_c record;
记录的字段数量与您在列表中包含列的数量相同,每个列都可以被引用通过列的名称。使用 format()
函数生成动态SQL也更好。
count()
也会返回一个 bigint
不是 INT
。您还需要将您选择的列转换为 varchar
,否则您不能返回例如一个整数值作为第一列。
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' ,distinct_count bigint)
AS
$$
DECLARE
var_c record;
BEGIN
FOR VAR_C IN(SELECT c.table_schema,c.column_name,c.table_name
FROM information_schema.columns c
WHERE lower(c.table_name)= TABLE_VALUE
和c.table_schema ='public')
LOOP
RETURN QUERY EXECUTE
格式('SELECT%I :: varchar,count(distinct%I)FROM%I.%I group by %I',
var_c.column_name,var_c.column_name,var_c.table_schema,var_c.table_name,var_c.column_name);
END LOOP;
END; $$
语言plpgsql;
占位符%I
code> i )会在必要时正确引用列或表名。您还应该确保包含模式名称。
语言名称是一个标识符,不要放在单引号中。
您也不需要在生成的SQL中指定列别名,因为输出列的名称由返回表(..)
部分。这使代码更容易阅读。
I am trying to do some data profiling using Postgres function. I have tried the below function which results in an error. As I am new to database functions, procedures etc.. I am finding difficult in fixing this issue.
Actual Work:
I want to loop through all columns in a table and perform data profiling i.e. Count, count distinct, nulls, not nulls for character columns. Min, Max for Numeric & date columns. Please assist
CREATE OR REPLACE FUNCTION data_profiling (TABLE_VALUE VARCHAR)
RETURNS TABLE (
col_value VARCHAR,
DISTINCT_COUNT INT
)
AS $$
DECLARE
var_c Varchar;
BEGIN
FOR var_c IN(SELECT c.column_name,c.table_name
FROM information_schema.columns c
WHERE lower(c.table_name) = TABLE_VALUE)
LOOP
RETURN QUERY EXECUTE 'SELECT ' || var_c ||' as col_name, count(distinct ' || var_c ||') as distinct_count
FROM ' || TABLE_VALUE || ' group by ' || var_c;
END LOOP;
END; $$
LANGUAGE 'plpgsql';
Error:
ERROR: structure of query does not match function result type
DETAIL: Returned type character(50) does not match expected type character varying in column 1.
CONTEXT: PL/pgSQL function data_profiling(character varying) line 10 at RETURN QUERY
A cursor returns a record not a varchar, you need to change your declaration to:
var_c record;
The record will have as many fields as you include columns in your select list, each one can be referenced through the column's name. It's also better to use the format()
function to generate the dynamic SQL.
count()
also returns a bigint
not an int
. You also need to cast the column you select to varchar
otherwise you can't return e.g. an integer value as the first column.
CREATE OR REPLACE FUNCTION data_profiling (table_value varchar)
RETURNS TABLE (col_value varchar, distinct_count bigint)
AS
$$
DECLARE
var_c record;
BEGIN
FOR var_c IN (SELECT c.table_schema, c.column_name,c.table_name
FROM information_schema.columns c
WHERE lower(c.table_name) = TABLE_VALUE
and c.table_schema = 'public')
LOOP
RETURN QUERY EXECUTE
format('SELECT %I::varchar, count(distinct %I) FROM %I.%I group by %I',
var_c.column_name, var_c.column_name, var_c.table_schema, var_c.table_name, var_c.column_name);
END LOOP;
END; $$
LANGUAGE plpgsql;
The placeholder %I
(a capital i
) will take care of properly quoting the column or table name if necessary. You should also make sure you include the schema name.
The language name is an identifier, don't put it in single quotes.
You also don't need to specify a column alias in the generated SQL, as the names of the output columns are defined by the returns table (..)
part. That makes the code a bit easier to read.
这篇关于使用PostgreSQL函数进行数据分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!