在大型Postgresl数据库上按日期获取最后一行 [英] Getting last rows by date on a large postgresl data base
问题描述
我有一个大数据库(每分钟有6000行插入)在分区表上,当数据库很小时,我正在工作很酷,现在我有一个大型数据库。我正在使用此解决方案以前的解决方案按日期加入SQL,但它正在使用250MB的硬盘并且在我的桌面增长时长大,然后我决定将其更改为简单查询的迭代,这对10行有效,但是缓慢的超过10辆车(15秒)作为响应,并使用超过200MB的硬盘。
I have a large database ( there are 6000 rows inserted per minute ) on a partitioned table, I was working cool when the database was small, now I have a large database. I was using this solution a previous solutions SQL joined by date but it is using 250MB of hard disk and grows up while my table grows, then I decide on change it to an iteration of simple queries, that works well for 10 rows, but get slow with more than 10 cars ( 15 secods for response ) and uses more than 200MB of hard disk.
我的问题是如何更快地建立一个好的查询来解决这个问题
My question is how to build a good query faster for resolve this issue
额外的信息
- 查询在ajax的django应用程序上调用
- 我正在考虑迭代ajax调用而不是一个调用完整的项目列表响应
- 表按日分隔
我的实际查询是
CREATE OR REPLACE FUNCTION gps_get_last_positions (
_plates varchar(8)
)
RETURNS TABLE (
plate varchar,
device_id integer,
date_time_process timestamp with time zone,
latitude double precision,
longitude double precision,
course smallint,
speed smallint,
mileage integer,
gps_signal smallint,
gsm_signal smallint,
alarm_status boolean,
gsm_status boolean,
vehicle_status boolean,
alarm_over_speed boolean,
other text,
realtime double precision
) AS $func$
DECLARE
arr varchar[];
BEGIN
arr := regexp_split_to_array(_plates, E'\\s+');
FOR i IN 1..array_length(arr, 1) LOOP
RETURN QUERY SELECT
gpstracking_vehicles.registration,
gpstracking_device_tracks.device_id,
gpstracking_device_tracks.date_time_process,
gpstracking_device_tracks.latitude,
gpstracking_device_tracks.longitude,
gpstracking_device_tracks.course,
gpstracking_device_tracks.speed,
gpstracking_device_tracks.mileage,
gpstracking_device_tracks.gps_signal,
gpstracking_device_tracks.gsm_signal,
gpstracking_device_tracks.alarm_status,
gpstracking_device_tracks.gps_status,
gpstracking_device_tracks.vehicle_status,
gpstracking_device_tracks.alarm_over_speed,
gpstracking_device_tracks.other,
EXTRACT(EPOCH FROM current_timestamp - gpstracking_device_tracks.date_time_process)/60 AS realtime
FROM (
gpstracking_devices INNER JOIN (
gpstracking_vehicles INNER JOIN gpstracking_vehicles_devices ON ( gpstracking_vehicles.id = gpstracking_vehicles_devices.vehicle_id AND gpstracking_vehicles_devices.is_joined = TRUE )
) ON ( gpstracking_devices.id = gpstracking_vehicles_devices.device_id AND gpstracking_vehicles_devices.is_joined = TRUE )
) INNER JOIN gpstracking_device_tracks ON ( gpstracking_devices.id = gpstracking_device_tracks.device_id )
WHERE gpstracking_vehicles.registration = arr[i]::VARCHAR
ORDER BY gpstracking_device_tracks.date_time_process DESC
LIMIT 1;
END LOOP;
RETURN;
END;
$func$
LANGUAGE plpgsql VOLATILE SECURITY DEFINER;
配置参数
application_name phpPgAdmin_5.0.4 client
constraint_exclusion on configuration file
DateStyle ISO, MDY session
default_text_search_config pg_catalog.english configuration file
external_pid_file /var/run/postgresql/9.1-main.pid configuration file
lc_messages en_US.UTF-8 configuration file
lc_monetary en_US.UTF-8 configuration file
lc_numeric en_US.UTF-8 configuration file
lc_time en_US.UTF-8 configuration file
log_line_prefix %t configuration file
log_timezone localtime environment variable
max_connections 100 configuration file
max_stack_depth 2MB environment variable
port 5432 configuration file
shared_buffers 24MB configuration file
ssl on configuration file
TimeZone localtime environment variable
unix_socket_directory /var/run/postgresql configuration file
我的第一个慢查询是: p>
My first slow query was:
CREATE OR REPLACE VIEW view_vehicle_devices AS
SELECT
gpstracking_vehicles_devices.id AS id,
gpstracking_devices.id AS device_id,
gpstracking_vehicles.id AS vehicle_id,
gpstracking_drivers.id AS driver_id,
gpstracking_device_protocols.name AS protocol,
gpstracking_vehicles.registration AS plate,
gpstracking_drivers.firstname as first_name,
gpstracking_drivers.lastname as last_name,
gpstracking_devices.imei,
gpstracking_devices.simcard,
gpstracking_device_tracks.date_time_process,
gpstracking_device_tracks.latitude,
gpstracking_device_tracks.longitude,
gpstracking_device_tracks.course,
gpstracking_device_tracks.speed,
gpstracking_device_tracks.mileage,
gpstracking_device_tracks.gps_signal,
gpstracking_device_tracks.gsm_signal,
gpstracking_device_tracks.alarm_status,
gpstracking_device_tracks.gps_status,
gpstracking_device_tracks.vehicle_status,
gpstracking_device_tracks.alarm_over_speed,
gpstracking_device_tracks.other,
gpstracking_device_tracks.point,
EXTRACT(EPOCH FROM current_timestamp - gpstracking_device_tracks.date_time_process)/60 realtime,
gpstracking_devices.created,
gpstracking_devices.updated,
gpstracking_devices.is_connected
FROM (
gpstracking_vehicles LEFT JOIN (
gpstracking_drivers LEFT JOIN gpstracking_vehicles_drivers ON gpstracking_drivers.id = gpstracking_vehicles_drivers.driver_id AND gpstracking_vehicles_drivers.is_joined = TRUE
) ON gpstracking_vehicles.id = gpstracking_vehicles_drivers.vehicle_id AND gpstracking_vehicles_drivers.is_joined = TRUE
) LEFT JOIN (((
gpstracking_device_protocols RIGHT JOIN gpstracking_devices ON gpstracking_device_protocols.id = gpstracking_devices.device_protocol_id
) LEFT JOIN (
SELECT DISTINCT ON (gpstracking_device_tracks.device_id) gpstracking_device_tracks.device_id,
gpstracking_device_tracks.date_time_process,
gpstracking_device_tracks.latitude,
gpstracking_device_tracks.longitude,
gpstracking_device_tracks.course,
gpstracking_device_tracks.speed,
gpstracking_device_tracks.mileage,
gpstracking_device_tracks.gps_signal,
gpstracking_device_tracks.gsm_signal,
gpstracking_device_tracks.alarm_status,
gpstracking_device_tracks.gps_status,
gpstracking_device_tracks.vehicle_status,
gpstracking_device_tracks.alarm_over_speed,
gpstracking_device_tracks.other,
gpstracking_device_tracks.point
FROM gpstracking_device_tracks
ORDER BY gpstracking_device_tracks.device_id, gpstracking_device_tracks.date_time_process DESC
) AS gpstracking_device_tracks ON gpstracking_devices.id = gpstracking_device_tracks.device_id
) LEFT JOIN gpstracking_vehicles_devices ON ( gpstracking_devices.id = gpstracking_vehicles_devices.device_id AND gpstracking_vehicles_devices.is_joined = TRUE )
) ON ( gpstracking_vehicles.id = gpstracking_vehicles_devices.vehicle_id AND gpstracking_vehicles_devices.is_joined = TRUE )
我已经改变了循环开始我的帖子,我的循环更快,但是不够快,因为我需要
I have changed it for the loop that is starting my post, my loop rujns faster, however is not enought faster as I need
推荐答案
您的问题是,策划者无法知道在哪个分区是一个回答您的查询。它只有统计数据。所以你不会受益于一天中的数据分割。
Your problem is that a planner can not know in which partition is an answer to your query. It only has statistics. So you do not benefit from partitioning your data by a day at all.
为了从中受益,您可以修改查询,以便从当天开始查找最新的坐标,如果不从昨天则发现,如果没有之前从日发现等等。我想99天的分区只能在今天的分区中找到。
To benefit from it you can modify your query so it'll look for latest coordinates from current day, if not found then from yesterday, if not found from a day before an so on. I suppose 99% answers would be found in todays partition only.
或者你可以通过分区例如 device_id%256
Or you can partition by for example device_id % 256
instead.
但是更好的是创建一个附加的表,只有几个最近的设备坐标。在 gpstracking_device_tracks
中将会执行(伪代码)维护:
But even better would be to create an additional table with several recent device coordinates only. It would be maintained with a trigger on gpstracking_device_tracks
which will just do (pseudocode):
if random()*64 < 1.0 then
-- statistically once per 64 runs do a cleanup
with todelete as (
-- lock rows in particular order to avoid possible deadlocks
-- if run concurrently
select id from gpstracking_device_tracks_recent where device_id=?
order by id for share;
)
delete from gpstracking_device_tracks_recent
where id in (select id from todelete)
end if;
insert into gpstracking_device_tracks_recent (...) values (...);
然后在这个更小的表格中查找最新的坐标。
And then look for latest coordinates in this much smaller table.
这篇关于在大型Postgresl数据库上按日期获取最后一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!