Postgresql:如何有效地填充具有随机字符串的一千万行的表 [英] Postgresql: How to efficient populate table with 10 millions rows with random string
问题描述
我有带模式的表:
test1 (id integer primary key , padding text)
(有索引)
我想拥有一千万随机填充(填充长度小于1024个符号)的行数。
如何快速生成并将其插入表中?
I want to have 10 millions of rows with random padding (padding length smaller than 1024 signs). How to generate it fast and insert it into table?
我正在尝试以下解决方案:
I'm trying right now this solution:
insert into test1 (select *, random_string(1024) from generate_series(0, 10000000));
其中random_string是一个函数:
where random_string is a function:
create or replace function random_string(length integer) returns text as
$$
declare
chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result text := '';
i integer := 0;
length2 integer := (select trunc(random() * length + 1));
begin
if length2 < 0 then
raise exception 'Given length cannot be less than 0';
end if;
for i in 1..length2 loop
result := result || chars[1+random()*(array_length(chars, 1)-1)];
end loop;
return result;
end;
$$ language plpgsql;
似乎PostgreSQL正在创建临时表,仅生成此临时表将花费2个小时。 / p>
It seems that postgresql is creating temporary table and only generating of this temporary table will take 2 hours.
推荐答案
如果效率是一个高度关注的问题,那么使用另一种语言生成数据可能更有意义。仔细优化,然后使用 COPY
语句将数据复制到表中。例如,如果您使用的是Linux-y系统,则可以将此C99程序另存为 generate_test1_data.c
:
If efficiency is a high concern, then it may make more sense to generate your data using a different language, one that is more carefully optimizable, and then use a COPY
statement to copy the data into your table. For example, if you're on a Linux-y system, you can save this C99 program as generate_test1_data.c
:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MIN_ID 0
#define MAX_ID 10000000
#define MAX_LEN 1024
#define CHARS "0123456789" \
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"abcdefghijklmnopqrstuvwxyz"
int main()
{
char const * const chars = CHARS;
int const num_chars = strlen(chars);
printf("COPY test1 FROM STDIN;\n");
for(int i = MIN_ID; i <= MAX_ID; ++i)
{
printf("%d\t", i); // can be optimized if needed
int const len = rand() % MAX_LEN + 1;
for(int j = 0; j < len; ++j)
putchar(chars[rand() % num_chars]);
putchar('\n');
}
printf("\\.\n");
return 0;
}
,然后运行以下命令:
gcc -std=c99 -Wall generate_test1_data.c -o generate_test1_data
./generate_test1_data > populate_test1.sql
psql -U ... -d ... -f populate_test1.sql
在我目前使用的开发箱上,整个过程将花费不到10分钟的时间(或者至少我用了一百分之一的数据就尝试了,花费了不到6秒的时间)。这是一个非常强大的盒子,因此在您的系统上,它可能需要的时间比—甚至很多更长—但是,我认为,这要比您当前使用的时间长。
On a development box that I have handy at the moment, this whole thing would take well under ten minutes (or at least, I tried it on one-hundredth the data, and that took well under six seconds). That's a pretty powerful box, so on your system it may take longer than that — maybe even much longer — but, I think, not nearly as long as your current approach.
这篇关于Postgresql:如何有效地填充具有随机字符串的一千万行的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!