큰 테이블에서 그룹당 가장 큰 가치를 얻는 효율적인 쿼리

13

주어진 테이블 :

    Column    |            Type             
 id           | integer                     
 latitude     | numeric(9,6)                
 longitude    | numeric(9,6)                
 speed        | integer                     
 equipment_id | integer                     
 created_at   | timestamp without time zone
Indexes:
    "geoposition_records_pkey" PRIMARY KEY, btree (id)

이 테이블에는 2 천만 건의 레코드 가 있으며 비교적 많은 수의 레코드 가 아닙니다. 그러나 순차적 스캔이 느려집니다.

max(created_at)각각 의 마지막 레코드를 어떻게 얻을 수 equipment_id있습니까?

이 주제에 대한 많은 답변을 읽은 몇 가지 변형을 사용하여 다음 쿼리를 모두 시도했습니다.

select max(created_at),equipment_id from geoposition_records group by equipment_id;

select distinct on (equipment_id) equipment_id,created_at 
  from geoposition_records order by equipment_id, created_at desc;

또한 btree 인덱스를 만들려고 equipment_id,created_at했지만 Postgres는 seqscan을 사용하는 것이 더 빠르다는 것을 알게되었습니다. enable_seqscan = off인덱스를 읽는 것이 seq 스캔만큼 느리기 때문에 강제 로 아무 소용이 없습니다.

쿼리는 항상 마지막을 반환하여 주기적으로 실행해야합니다.

Postgres 사용 9.3.

설명 / 분석 (170 만 레코드) :

set enable_seqscan=true;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"HashAggregate  (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)"
"  ->  Seq Scan on geoposition_records  (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)"
"Total runtime: 1935.632 ms"

set enable_seqscan=false;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"GroupAggregate  (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)"
"  ->  Index Scan using geoposition_records_equipment_id_created_at_idx on geoposition_records  (cost=0.00..2987673.75 rows=1651851 width=12) (actual time=0.062..10248.703 rows=1651851 loops=1)"
"Total runtime: 11305.161 ms"

— 피드
소스

물론 지난 시간 I의 한 더 없었다 검사 NULL의 값 equipment_id예상 비율이 0.1 % 이하

— Feyd

10

일반 다중 열 b- 트리 인덱스는 결국 작동해야합니다.

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);

왜 DESC NULLS LAST?

날짜 범위 쿼리에서 사용되지 않은 인덱스

함수

쿼리 플래너에 대해 이해가되지 않으면 장비 테이블을 반복하는 함수가 트릭을 수행해야합니다. 한 번에 하나의 equipment_id를 조회하면 색인이 사용됩니다. 소수 (57에서 EXPLAIN ANALYZE출력으로 판단 )의 경우 빠릅니다. 테이블
이 있다고 가정하는 것이 안전 equipment합니까?

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp) AS
$func$
BEGIN
FOR equipment_id IN
   SELECT e.equipment_id FROM equipment e ORDER BY 1
LOOP
   SELECT g.created_at
   FROM   geoposition_records g
   WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
   ORDER  BY g.created_at DESC NULLS LAST
   LIMIT  1
   INTO   latest;

   RETURN NEXT;
END LOOP;
END  
$func$  LANGUAGE plpgsql STABLE;

좋은 전화를합니다 :

SELECT * FROM f_latest_equip();

상관 서브 쿼리

이 equipment표를 사용하면 상관 관계가 낮은 하위 쿼리를 사용하여 더러운 작업에 큰 효과를 줄 수 있습니다.

SELECT equipment_id
     ,(SELECT created_at
       FROM   geoposition_records
       WHERE  equipment_id = eq.equipment_id
       ORDER  BY created_at DESC NULLS LAST
       LIMIT  1) AS latest
FROM   equipment eq;

성능이 매우 좋습니다.

`LATERAL` Postgres에 가입 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;

상해:

사용자 별 최신 레코드를 검색하도록 GROUP BY 쿼리 최적화

상관 서브 쿼리와 유사한 성능. 의 성능 비교 max(), DISTINCT ON함수, 상관 하위 쿼리 및 LATERAL이의를 :

SQL 바이올린 .

— 어윈 브랜드 스티 터
소스

1

@ ErwinBrandstetter 이것은 Colin의 답변을 시도한 후에 시도한 것이지만, 이것이 일종의 데이터베이스 측 n + 1 쿼리를 사용하는 해결 방법이라고 생각할 수는 없습니다. 연결 오버 헤드 없음) ... 그룹별로 존재하는 이유가 무엇인지 궁금합니다. 수백만 개의 레코드를 올바르게 처리 할 수 없다면 이해가되지 않습니다. 우리가 놓친 것입니다. 마지막으로 질문이 약간 바뀌었고 장비 테이블이 있다고 가정하고 있습니다. 실제로 다른 방법이 있는지 알고 싶습니다

— Feyd

3

시도 1

만약

별도의 equipment테이블이 있고
에 대한 색인이 있습니다 geoposition_records(equipment_id, created_at desc)

다음은 저에게 효과적입니다.

select id as equipment_id, (select max(created_at)
                            from geoposition_records
                            where equipment_id = equipment.id
                           ) as max_created_at
from equipment;

의 목록 과 관련 목록을 모두 결정하기 위해 PG에서 빠른 쿼리를 수행하도록 강요 할 수 없었습니다 . 그러나 나는 내일 다시 시도 할 것이다!equipment_idmax(created_at)

시도 2

이 링크를 찾았습니다 : http://zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values 이 기술을 시도 1의 쿼리와 결합하면 다음을 얻습니다.

WITH RECURSIVE equipment(id) AS (
    SELECT MIN(equipment_id) FROM geoposition_records
  UNION
    SELECT (
      SELECT equipment_id
      FROM geoposition_records
      WHERE equipment_id > equipment.id
      ORDER BY equipment_id
      LIMIT 1
    )
    FROM equipment WHERE id IS NOT NULL
)
SELECT id AS equipment_id, (SELECT MAX(created_at)
                            FROM geoposition_records
                            WHERE equipment_id = equipment.id
                           ) AS max_created_at
FROM equipment;

그리고 이것은 빨리 작동합니다! 그러나 당신은 필요합니다

이 매우 변형 된 쿼리 양식
에 대한 색인 geoposition_records(equipment_id, created_at desc).

— 콜린 하트
소스

큰 테이블에서 그룹당 가장 큰 가치를 얻는 효율적인 쿼리

함수

상관 서브 쿼리

LATERAL Postgres에 가입 9.3+

`LATERAL` Postgres에 가입 9.3+