使用PostgreSQL函数进行数据分析 [英] Data Profiling using PostgreSQL Function

查看:1800
本文介绍了使用PostgreSQL函数进行数据分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Postgres函数进行一些数据分析。我试过下面的函数导致错误。由于我是数据库功能,程序等新手。我发现很难解决这个问题。



实际工作

我想循环遍历表中的所有列并执行数据分析,即计数,计数不同,空值,而不是字符列的空值。数字和符号的最小值,最大值日期栏。请帮助

pre $ CREATE OR REPLACE FUNCTION data_profiling(TABLE_VALUE VARCHAR)
RETURNS TABLE(
col_value VARCHAR,
DISTINCT_COUNT INT

AS $$
DECLARE
var_c Varchar;

BEGIN
FOR VAR_C IN(SELECT c.column_name,c.table_name
FROM information_schema.columns c
WHERE lower(c.table_name)= TABLE_VALUE)
LOOP
RETURN QUERY EXECUTE'SELECT'|| var_c ||'作为col_name,count(distinct'|| var_c ||')作为distinct_count
FROM'|| TABLE_VALUE || 'group by'|| var_c;
END LOOP;
END; $$
LANGUAGE'plpgsql';

错误:

 错误:查询结构与函数结果类型不匹配
细节:返回类型字符(50)与列1中不​​同的预期类型字符不匹配。
CONTEXT:PL / pgSQL函数data_profiling(字符变化)第10行返回QUERY


解决方案

游标返回一个记录不是varchar,您需要将您的声明更改为:

  var_c record; 

记录的字段数量与您在列表中包含列的数量相同,每个列都可以被引用通过列的名称。使用 format()函数生成动态SQL也更好。

count()也会返回一个 bigint 不是 INT 。您还需要将您选择的列转换为 varchar ,否则您不能返回例如一个整数值作为第一列。

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' ,distinct_count bigint)
AS
$$
DECLARE
var_c record;
BEGIN
FOR VAR_C IN(SELECT c.table_schema,c.column_name,c.table_name
FROM information_schema.columns c
WHERE lower(c.table_name)= TABLE_VALUE
和c.table_schema ='public')
LOOP
RETURN QUERY EXECUTE
格式('SELECT%I :: varchar,count(distinct%I)FROM%I.%I group by %I',
var_c.column_name,var_c.column_name,var_c.table_schema,var_c.table_name,var_c.column_name);
END LOOP;
END; $$
语言plpgsql;

占位符%I code> i )会在必要时正确引用列或表名。您还应该确保包含模式名称。

语言名称是一个标识符,不要放在单引号中。



您也不需要在生成的SQL中指定列别名,因为输出列的名称由返回表(..)部分。这使代码更容易阅读。


I am trying to do some data profiling using Postgres function. I have tried the below function which results in an error. As I am new to database functions, procedures etc.. I am finding difficult in fixing this issue.

Actual Work:

I want to loop through all columns in a table and perform data profiling i.e. Count, count distinct, nulls, not nulls for character columns. Min, Max for Numeric & date columns. Please assist

CREATE OR REPLACE FUNCTION data_profiling (TABLE_VALUE VARCHAR) 
 RETURNS TABLE (
 col_value VARCHAR,
 DISTINCT_COUNT INT
) 
AS $$
DECLARE 
    var_c Varchar;

BEGIN
   FOR var_c IN(SELECT c.column_name,c.table_name
        FROM information_schema.columns c
        WHERE lower(c.table_name) = TABLE_VALUE)  
     LOOP
RETURN QUERY EXECUTE 'SELECT ' || var_c ||' as col_name, count(distinct ' || var_c ||') as distinct_count
FROM ' || TABLE_VALUE || ' group by ' || var_c;
            END LOOP;
END; $$ 
 LANGUAGE 'plpgsql';

Error:

ERROR:  structure of query does not match function result type
DETAIL:  Returned type character(50) does not match expected type character varying in column 1.
CONTEXT:  PL/pgSQL function data_profiling(character varying) line 10 at RETURN QUERY

解决方案

A cursor returns a record not a varchar, you need to change your declaration to:

var_c record;

The record will have as many fields as you include columns in your select list, each one can be referenced through the column's name. It's also better to use the format() function to generate the dynamic SQL.

count() also returns a bigint not an int. You also need to cast the column you select to varchar otherwise you can't return e.g. an integer value as the first column.

CREATE OR REPLACE FUNCTION data_profiling (table_value varchar) 
  RETURNS TABLE (col_value varchar, distinct_count bigint)
AS 
$$
DECLARE 
    var_c record;
BEGIN
   FOR var_c IN (SELECT c.table_schema, c.column_name,c.table_name
                 FROM information_schema.columns c
                 WHERE lower(c.table_name) = TABLE_VALUE
                   and c.table_schema = 'public')  
   LOOP
       RETURN QUERY EXECUTE 
        format('SELECT %I::varchar, count(distinct %I) FROM %I.%I group by %I', 
               var_c.column_name, var_c.column_name, var_c.table_schema, var_c.table_name, var_c.column_name);
   END LOOP;
END; $$ 
LANGUAGE plpgsql;

The placeholder %I (a capital i) will take care of properly quoting the column or table name if necessary. You should also make sure you include the schema name.

The language name is an identifier, don't put it in single quotes.

You also don't need to specify a column alias in the generated SQL, as the names of the output columns are defined by the returns table (..) part. That makes the code a bit easier to read.

这篇关于使用PostgreSQL函数进行数据分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆