digoal
欢迎数据库应用开发者参与贡献场景, 在此issue回复即可, 共同建设《沉浸式数据库学习教学素材库》, 帮助开发者用好数据库, 提升开发者职业竞争力, 同时为企业降本提效.
本文的实验可以使用永久免费的阿里云云起实验室来完成.
如果你本地有docker环境也可以把镜像拉到本地来做实验:
x86_64机器使用以下docker image:
ARM机器使用以下docker image:
租用共享单车, 从开始租用到还车, 这段用户骑行的过程就是一段轨迹, 例如每5秒记录一个点(x,y,z).
很多小程序或健康APP通过手机或手环的定位来记录轨迹.
专业的驴友, 通常都有非常专业的带GPS的手表也会记录徒步的轨迹.
网约车就更不用说了, 行业要求必须记录每一单的行驶信息, 其中也包含了轨迹数据.
当用户量大了之后, 查询轨迹、分析轨迹可能是一个比较高频的操作.
这个实验的目的是帮助你了解如何设计轨迹表, 如何高性能的写入、查询、分析轨迹数据.
轨迹的生成的特点:
基于这两个特点, 如果每生成1条轨迹就往数据库中写入一条记录, 会导致一个轨迹若干个点的物理存储是非常离散的.
即使一次写入一个轨迹的多条记录, 也可能出现一个轨迹若干个点的物理存储是非常离散的情况.
轨迹中的每个点存储一条记录, 最终拼成一条轨迹.
设计2个表:
1、存储每个轨迹的唯一轨迹 ID
drop table if exists tbl_path_id;
create table tbl_path_id (
id int primary key, -- 轨迹ID
uid int, -- 这段轨迹对应的用户
ts_begin timestamp, -- 轨迹开始时间
ts_end timestamp, -- 轨迹结束时间
pos_begin point, -- 为了简化例子, 使用PG|PolarDB内置的平面点. 真实场景通常是用PostGIS插件的geometry类型, 包括经纬度和海平面高度.
pos_end point -- 轨迹结束位置
);
2、存储每个对象的一段运动轨迹(例如单车骑行的一单轨迹、徒步旅游开启的轨迹记录、...)
drop table if exists tbl_path_detail;
create unlogged table tbl_path_detail ( -- 方便测试, 使用unlogged table.
pid int, -- 轨迹ID
pos point, -- 位置
ts timestamp -- 时间
);
3、创建轨迹索引, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.
create index on tbl_path_detail (pid,ts);
4、为了方便生成测试数据, 先模拟写入1000条轨迹, 平均每条轨迹1万个点, 形成1000万条轨迹点, 再统计到轨迹唯一标识表.
vi t1.sql
\set pid random(1,1000)
insert into tbl_path_detail values (:pid, point(random(), random()), now());
pgbench -M prepared -n -r -P 1 -f ./t1.sql -c 10 -j 10 -t 1000000
transaction type: ./t1.sql
scaling factor: 1
query mode: prepared
number of clients: 10
number of threads: 10
number of transactions per client: 1000000
number of transactions actually processed: 10000000/10000000
latency average = 0.053 ms
latency stddev = 0.102 ms
initial connection time = 23.060 ms
tps = 189004.207706 (without initial connection time)
statement latencies in milliseconds:
0.000 \set pid random(1,1000)
0.052 insert into tbl_path_detail values (:pid, point(random(), random()), now());
postgres=# select * from tbl_path_detail limit 10;
pid | pos | ts
-----+------------------------------------------+----------------------------
160 | (0.5000283130032592,0.15678314824874917) | 2023-09-08 02:05:15.442562
43 | (0.7333981781385361,0.6660386524481545) | 2023-09-08 02:05:15.442563
602 | (0.5665795071743318,0.7419887321824241) | 2023-09-08 02:05:15.44323
738 | (0.4883731035056087,0.5410191566444489) | 2023-09-08 02:05:15.444977
925 | (0.47646978937058293,0.3324764895717003) | 2023-09-08 02:05:15.445011
934 | (0.8969771676532297,0.11898616861406452) | 2023-09-08 02:05:15.445049
748 | (0.0461507467234199,0.3288353473172556) | 2023-09-08 02:05:15.445128
330 | (0.05559804559846171,0.7381967670381471) | 2023-09-08 02:05:15.445196
978 | (0.21187515313173577,0.6163791267053291) | 2023-09-08 02:05:15.445191
626 | (0.7069111074590744,0.4406223155616047) | 2023-09-08 02:05:15.445312
(10 rows)
5、将轨迹点汇总到轨迹唯一标识表
insert into tbl_path_id (id,uid,ts_begin,ts_end,pos_begin,pos_end)
select pid, random()*100, ts_begin, ts_end, pos_begin, pos_end from
(select pid,
row_number() over w as rn,
first_value(ts) over w as ts_begin,
last_value(ts) over w as ts_end,
first_value(pos) over w as pos_begin,
last_value(pos) over w as pos_end
from tbl_path_detail
window w as (partition by pid order by ts RANGE BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) ) t
where rn=1;
postgres=# select * from tbl_path_id limit 10;
id | uid | ts_begin | ts_end | pos_begin | pos_end
----+-----+----------------------------+----------------------------+-------------------------------------------+-------------------------------------------
1 | 2 | 2023-09-08 02:05:15.458977 | 2023-09-08 02:06:08.348456 | (0.9431799157631602,0.41223770583725994) | (0.43968362954398543,0.8294152062281377)
2 | 15 | 2023-09-08 02:05:15.455308 | 2023-09-08 02:06:08.348047 | (0.20157767943472393,0.7583908472256411) | (0.8633428320677936,0.020082269279203757)
3 | 60 | 2023-09-08 02:05:15.450998 | 2023-09-08 02:06:08.273702 | (0.8676802874185334,0.9332642681167798) | (0.7358999901381367,0.8543637573444975)
4 | 28 | 2023-09-08 02:05:15.455818 | 2023-09-08 02:06:08.329148 | (0.5804204502817427,0.31601818657295055) | (0.0174753704646875,0.18107497337433998)
5 | 97 | 2023-09-08 02:05:15.458717 | 2023-09-08 02:06:08.3295 | (0.24820536688948636,0.802131631221048) | (0.8229560445619342,0.5435402893223191)
6 | 33 | 2023-09-08 02:05:15.450005 | 2023-09-08 02:06:08.312634 | (0.527644965371401,0.12866950317216563) | (0.697993640371152,0.6356491117241596)
7 | 3 | 2023-09-08 02:05:15.462201 | 2023-09-08 02:06:08.345394 | (0.23643719560948284,0.6980441144259366) | (0.7981387104508499,0.4564650368837313)
8 | 38 | 2023-09-08 02:05:15.457057 | 2023-09-08 02:06:08.275626 | (0.17698975732867694,0.33573573990590333) | (0.6194784648490028,0.8182679927624612)
9 | 18 | 2023-09-08 02:05:15.454722 | 2023-09-08 02:06:08.350231 | (0.4388676699772063,0.7529218330600749) | (0.2080260827608882,0.04568571675149613)
10 | 40 | 2023-09-08 02:05:15.452294 | 2023-09-08 02:06:08.326824 | (0.6334737638364523,0.07649313466173169) | (0.8915125815199474,0.8221032349776571)
(10 rows)
6、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.
vi t2.sql
\set pid random(1,1000)
select * from tbl_path_detail where pid=:pid order by ts;
pgbench -M prepared -n -r -P 1 -f ./t2.sql -c 10 -j 10 -T 120
transaction type: ./t2.sql
scaling factor: 1
query mode: prepared
number of clients: 10
number of threads: 10
duration: 120 s
number of transactions actually processed: 15979
latency average = 75.097 ms
latency stddev = 9.750 ms
initial connection time = 20.363 ms
tps = 133.126750 (without initial connection time)
statement latencies in milliseconds:
0.001 \set pid random(1,1000)
75.097 select * from tbl_path_detail where pid=:pid order by ts;
在传统方法中, 平均一个轨迹有1万个点, 由于不同的轨迹点是并行写入的, 所以这1万个点非常可能分散在1万个数据块中, 导致查询1条轨迹要访问很多的数据块. 使得性能瓶颈在IO层面(如下这条轨迹消耗9538个IO).
postgres=# explain (analyze,verbose,timing,buffers,costs,settings,wal) select * from tbl_path_detail where pid=1 order by ts;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tbl_path_detail_pid_ts_idx on public.tbl_path_detail (cost=0.43..10473.93 rows=9955 width=28) (actual time=0.060..24.022 rows=9937 loops=1)
Output: pid, pos, ts
Index Cond: (tbl_path_detail.pid = 1)
Buffers: shared hit=9538
Settings: enable_seqscan = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1'
Planning Time: 0.183 ms
Execution Time: 25.039 ms
(7 rows)
PG|PolarDB支持include index, 为了解决这个问题, 可以将要查询的字段放入索引的叶子结点内, 避免回表导致的IO.
1、创建include index, 把pos放入索引叶子结点:
create index on tbl_path_detail (pid,ts) INCLUDE (pos);
vacuum analyze tbl_path_detail;
2、此时使用新的index查询, 不需要回表, IO大幅度降低, 如下这条轨迹消耗66个IO:
postgres=# explain (analyze,verbose,timing,buffers,costs,settings,wal) select * from tbl_path_detail where pid=1 order by ts;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using tbl_path_detail_pid_ts_pos_idx on public.tbl_path_detail (cost=0.56..241.93 rows=9958 width=28) (actual time=0.082..4.217 rows=9937 loops=1)
Output: pid, pos, ts
Index Cond: (tbl_path_detail.pid = 1)
Heap Fetches: 0
Buffers: shared hit=66
Settings: enable_seqscan = 'off', max_parallel_workers_per_gather = '0', random_page_cost = '1.1'
Planning Time: 0.536 ms
Execution Time: 5.598 ms
(8 rows)
3、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.
vi t2.sql
\set pid random(1,1000)
select * from tbl_path_detail where pid=:pid order by ts;
pgbench -M prepared -n -r -P 1 -f ./t2.sql -c 10 -j 10 -T 120
transaction type: ./t2.sql
scaling factor: 1
query mode: prepared
number of clients: 10
number of threads: 10
duration: 120 s
number of transactions actually processed: 113515
latency average = 10.570 ms
latency stddev = 2.942 ms
initial connection time = 18.558 ms
tps = 945.998467 (without initial connection time)
statement latencies in milliseconds:
0.000 \set pid random(1,1000)
10.570 select * from tbl_path_detail where pid=:pid order by ts;
虽然使用了include index, 但是索引里面就要放所有要查的字段内容, 导致索引占用空间会变大一点.
为了更好的优化这个能力, 可以使用聚集字段来存储一条轨迹的内容, 例如array.
1、创建一张轨迹表, 使用array存储轨迹中每个点的时间、位置.
drop table if exists tbl_path_detail1;
create unlogged table tbl_path_detail1 ( -- 方便测试, 使用unlogged table.
pid int, -- 轨迹ID
pos point[], -- 位置数组
ts timestamp[] -- 时间数组
);
create index on tbl_path_detail1 (pid);
2、将原始轨迹写入新的轨迹表.
insert into tbl_path_detail1 select pid, array_agg(pos order by ts), array_agg(ts order by ts) from tbl_path_detail group by pid;
postgres=# select pid, pos[1:10], ts[1:10] from tbl_path_detail1 where pid=1;
-[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid | 1
pos | {"(0.9431799157631602,0.41223770583725994)","(0.029285214711183727,0.13121257852994006)","(0.47231141497136164,0.40519912875600994)","(0.18841309050989707,0.9120792659923715)","(0.1669935448355666,0.788043626909726)","(0.7680464170226315,0.02176761079732259)","(0.1806378616460016,0.17626971610524578)","(0.6905476714298793,0.21249186817883725)","(0.20480152830909404,0.36856166073248176)","(0.37764392812544045,0.5214349561346836)"}
ts | {"2023-09-08 02:05:15.458977","2023-09-08 02:05:15.460392","2023-09-08 02:05:15.46647","2023-09-08 02:05:15.466755","2023-09-08 02:05:15.46993","2023-09-08 02:05:15.470127","2023-09-08 02:05:15.473513","2023-09-08 02:05:15.474446","2023-09-08 02:05:15.480445","2023-09-08 02:05:15.484117"}
应用层面, 未来在写入轨迹时也可以使用这个方法, 等轨迹完成一次性上传并写入新的轨迹表, 老的轨迹表就不需要了.
3、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.
vi t3.sql
\set pid random(1,1000)
select * from tbl_path_detail1 where pid=:pid;
pgbench -M prepared -n -r -P 1 -f ./t3.sql -c 10 -j 10 -T 120
transaction type: ./t3.sql
scaling factor: 1
query mode: prepared
number of clients: 10
number of threads: 10
duration: 120 s
number of transactions actually processed: 65173
latency average = 18.410 ms
latency stddev = 5.716 ms
initial connection time = 23.213 ms
tps = 543.135449 (without initial connection time)
statement latencies in milliseconds:
0.000 \set pid random(1,1000)
18.410 select * from tbl_path_detail1 where pid=:pid;
或者也可以使用text存储轨迹点.
4、创建一张轨迹表, 使用array存储轨迹中每个点的时间、位置.
drop table if exists tbl_path_detail2;
create unlogged table tbl_path_detail2 ( -- 方便测试, 使用unlogged table.
pid int, -- 轨迹ID
pos text, -- 位置字符串
ts text -- 时间字符串
);
create index on tbl_path_detail2 (pid);
5、将原始轨迹写入新的轨迹表.
insert into tbl_path_detail2 select pid, string_agg(pos::text, ',' order by ts), string_agg(ts::text, ',' order by ts) from tbl_path_detail group by pid;
应用层面, 未来在写入轨迹时也可以使用这个方法, 等轨迹完成一次性上传并写入新的轨迹表, 老的轨迹表就不需要了.
6、压测查询轨迹, 按轨迹ID查询, 按时间顺序返回轨迹的所有点.
vi t4.sql
\set pid random(1,1000)
select * from tbl_path_detail2 where pid=:pid;
pgbench -M prepared -n -r -P 1 -f ./t4.sql -c 10 -j 10 -T 120
transaction type: ./t4.sql
scaling factor: 1
query mode: prepared
number of clients: 10
number of threads: 10
duration: 120 s
number of transactions actually processed: 246670
latency average = 4.864 ms
latency stddev = 1.974 ms
initial connection time = 21.508 ms
tps = 2055.793315 (without initial connection time)
statement latencies in milliseconds:
0.000 \set pid random(1,1000)
4.864 select * from tbl_path_detail2 where pid=:pid;
对比轨迹表的空间和索引占用
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------------------+-------+----------+-------------+---------------+---------+-------------
public | tbl_path_detail | table | postgres | unlogged | heap | 575 MB |
public | tbl_path_detail1 | table | postgres | unlogged | heap | 202 MB |
public | tbl_path_detail2 | table | postgres | unlogged | heap | 346 MB |
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+--------------------------------+-------+----------+------------------+-------------+---------------+---------+-------------
public | tbl_path_detail1_pid_idx | index | postgres | tbl_path_detail1 | unlogged | btree | 48 kB |
public | tbl_path_detail2_pid_idx | index | postgres | tbl_path_detail2 | unlogged | btree | 48 kB |
public | tbl_path_detail_pid_ts_idx | index | postgres | tbl_path_detail | unlogged | btree | 301 MB |
public | tbl_path_detail_pid_ts_pos_idx | index | postgres | tbl_path_detail | unlogged | btree | 473 MB |
聚集类型既能节省轨迹存储、索引空间, 又能提升轨迹查询的性能.
思考: 如果要按时间范围、空间范围搜索用户过去的轨迹, 然后再查询对应的轨迹点, 应该如何设计表结构? 如何使用索引?
1、空间对比
轨迹存储方式 | 轨迹占用空间 | 轨迹索引占用空间 |
---|---|---|
传统存储(每个点一条记录,索引不包含位置) | 575 MB | 301 MB |
传统存储(每个点一条记录,索引包含位置) | 575 MB | 473 MB |
数组存储轨迹 | 202 MB | 48 kB |
字符串存储轨迹 | 346 MB | 48 kB |
2、性能对比
轨迹存储方式 | 轨迹查询QPS |
---|---|
传统存储(每个点一条记录,索引不包含位置) | 133 |
传统存储(每个点一条记录,索引包含位置) | 946 |
数组存储轨迹 | 543 |
字符串存储轨迹 | 2056 |
array
include index
window function
轨迹分析应用: 轨迹碰撞, 例如刑侦场景
如果按用户的时间范围查询轨迹, 使用什么数据类型、索引性能更好? range type + gist index ?
如何按用户的空间范围查询轨迹, 使用什么数据类型、索引性能更好? range type + gist index ?
如果一次不想取出所有的点, 如何从字符串存储的轨迹点中获取某个时间对应的点, 或者某一段点?
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。