6 Star 30 Fork 21

xuejianxinokok/mysqldoc

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
17_Statistics.md 49.85 KB
一键复制 编辑 原始数据 按行查看 历史
xuejianxinokok 提交于 4个月前 . 333

17 Statistics

17.1 Basic Statistics

Basic relation-level statistics are stored in the pg_class table of the system catalog and include the following data:
基本关系级别统计信息存储在系统目录的pg_class表中,包括以下数据:

  • number of tuples in a relation (reltuples)
    关系中元组的数量 (reltuples)
  • relation size, in pages (relpages)
    关系大小,以页为单位 (relpages),一个表目前有多少数据页
  • number of pages tagged in the visibility map (relallvisible)
    可见性图中标记的页面数(reallvisible)

Here are these values for the flights table:
以下是flights表的这些值:

=>  SELECT reltuples, relpages, relallvisible
FROM  pg_class WHERE  relname = 'flights';
 reltuples | relpages | relallvisible
-----------+----------+--------------------
     214867|      2624|          2624
(1 row)

If the query does not impose any filter conditions, the reltuples value serves as the cardinality estimation:
如果查询没有施加任何过滤条件,则 reltuples 值将用作基数估计:

=>  EXPLAIN SELECT  * FROM flights;

                             QUERY PLAN
----------------------------------------------------------------
 Seq Scan on flights    (cost=0.00..4772.67 rows=214867 width=63)
(1 row)

-- width=63 每行平均宽度为 63 字节

Statistics are collected during table analysis, both manual and automatic. Furthermore, since basic statistics are of paramount importance, this data is calcu- lated during some other operations as well (VACUUM FULL and CLUSTER,CREATE INDEX and REINDEX) and is refined during vacuuming.
统计数据是在 table analysis 期间收集的,包括手动和自动。此外,由于基本统计数据至关重要,因此这些数据也会在其他一些操作(VACUUM FULL 和 CLUSTER、CREATE INDEX 和 REINDEX)期间进行计算,并在vacuum期间进行细化。

For analysis purposes, 300×default_statistics_target random rows are sampled. The sample size required to build statistics of a particular accuracy has low dependency on the volume of analyzed data, so the size of the table is not taken into account.
出于分析目的,对 300×default_statistics_target 随机行 进行采样。构建特定精度的统计数据所需的样本大小对分析数据量的依赖性较低,因此不考虑表的大小。

Sampled rows are picked from the same number (300 × default_statistics_target) of random pages. Obviously, if the table itself is smaller, fewer pages may be read, and fewer rows will be selected for analysis.
采样行是从相同数量 (300 × default_statistics_target) 的随机页中选取的。 显然,如果表本身较小,则可能会读取较少的页,并且会选择较少的行进行分析。

In large tables, statistics collection does not include all the rows, so estimations can diverge from actual values. It is perfectly normal: if the data is changing, statistics cannot be accurate all the time anyway. Accuracy of up to an order of magnitude is usually enough to choose an adequate plan.
在大型表中,统计信息收集不包括所有行,因此估计值可能与实际值有所不同。这是完全正常的:如果数据在变化,统计数据无论如何也不可能始终准确。高达一个数量级的准确性通常足以选择适当的计划。

Let’s create a copy of the flights table with autovacuum disabled, so that we can control the autoanalysis start time:
让我们创建一个禁用 autovacuum 的flights表副本,以便我们可以控制自动分析开始时间:

 => CREATE TABLE  flights_copy(LIKE  flights)
 WITH (autovacuum_enabled =  false);

There is no statistics for the new table yet:
新表还没有统计信息:

 => SELECT reltuples, relpages, relallvisible
 FROM pg_class  WHERE relname = 'flights_copy';
  reltuples | relpages | relallvisible
 -----------+----------+------------------
         -1 |        0 |            0
 (1 row)

The value reltuples = -1 is used to differentiate between a table that has not been analyzed yet and a really empty table without any rows.
值 reltuples = -1 用于区分尚未分析的表和没有任何行的真正 空表。

It is highly likely that some rows will get inserted into the table right after its cre- ation. So being unaware of the current state of things, the planner assumes that the table contains 10 pages:
很可能某些行会在表创建后立即插入到表中。因此,由于不知道当前的状态,规划器 假设该表包含 10 页:

=>  EXPLAIN SELECT  * FROM flights_copy;

                             QUERY PLAN
------------------------------------------------------
 Seq Scan on flights_copy    (cost=0.00..14.10 rows=410 width=170)
(1 row)

The number of rows is estimated based on the size of a single row, which is shown in the plan as width. Row width is typically an average value calculated during analy- sis, but since no statistics have been collected yet, here it is just an approximation based on the column data types.
行数是根据 单行的大小 估算的,在计划中显示为 width. width 通常是分析过程中计算出的平均值,但由于尚未收集统计数据,因此这里它只是基于列数据类型的近似值

Now let’s copy the data from the flights table and perform the analysis:
现在让我们从 flights 表中复制数据并进行分析:

=>  INSERT INTO flights_copy  SELECT  * FROM flights;
INSERT 0 214867
=>  ANALYZE flights_copy;

The collected statistics reflects the actual number of rows (the table size is small enough for the analyzer to gather statistics on all the data):
收集的统计信息反映了实际的行数(表比较小,分析器可以收集所有数据的统计信息):

=>  SELECT reltuples, relpages, relallvisible
FROM  pg_class WHERE  relname = 'flights_copy';
 reltuples | relpages | relallvisible
-----------+----------+------------------
     214867|      2624|               0
(1 row)

The relallvisible value is used to estimate the cost of an index-only scan. This value is updated by VACUUM:
reallvisible 值用于估计仅索引扫描的成本。该值由 VACUUM 更新:

=>  VACUUM flights_copy;

=>  SELECT relallvisible FROM pg_class WHERE relname = 'flights_copy';
  relallvisible
--------------------
             2624

(1 row)

Now let’s double the number of rows without updating statistics and check the cardinality estimation in the query plan:
现在,让我们在不更新统计信息的情况下将行数加倍,并检查查询计划中的 基数rows估计:

=>  INSERT INTO    flights_copy   SELECT   * FROM  flights;
=>  SELECT   count(*)  FROM   flights_copy;
  count
--------------------
  429734
(1 row)
=>  EXPLAIN SELECT    *  FROM  flights_copy;
                                    QUERY PLAN
--------------------------------------------------------------------

  Seq Scan on flights_copy     (cost=0.00..9545.34 rows=429734 width=63)
(1 row)

Despite the outdated pg_class data, the estimation turns out to be accurate:
尽管 pg_class 数据已经过时,但估计结果是准确的:

=>  SELECT   reltuples, relpages
FROM  pg_class   WHERE   relname = 'flights_copy';
  reltuples | relpages
------------+--------------------
     214867 |        2624
(1 row)

The thing is that if the planner sees a gap between relpages and the actual file size, it can scale the reltuples value to improve estimation accuracy.1 Since the file size has doubled as compared to relpages, the planner adjusts the estimated number of rows, assuming that data density remains the same:
问题是,如果规划器发现 relpages 和实际文件大小之间存在差距,它可以缩放 reltuples 值以提高估计准确性。 由于文件大小与 relpages 相比增加了一倍,因此规划器会调整估计的行数,假设数据密度保持不变:

=>  SELECT   reltuples *
   (pg_relation_size('flights_copy') / 8192) / relpages    AS  tuples
FROM  pg_class   WHERE   relname = 'flights_copy';
 tuples
--------------------
 429734
(1 row)

Naturally, such an adjustment may not always work (for example, if we delete some rows, the estimation will remain the same), but in some cases it allows the planner to hold on until significant changes trigger the next analysis run.
当然,这样的调整可能并不总是有效(例如,如果我们删除一些行,估计将保持不变),但在某些情况下,它允许 规划器 保持原样,直到重大变化触发下一次分析运行。

17.2 NULL Values

Frowned upon by theoreticians,NULL values still play an important role in rela- tional databases: they provide a convenient way to reflect the fact that a value is either unknown or does not exist.
虽然理论家们不赞成,但 NULL 值在关系数据库中仍然发挥着重要作用:它们提供了一种方便的方法来反映值未知或不存在的事实。

However, a special value demands special treatment. Apart from theoretical incon- sistencies, there are also multiple practical challenges that have to be taken into account. Regular Boolean logic is replaced by the three-valued one, so NOT IN be- haves unexpectedly. It is unclear whether NULL values should be treated as greater than or less than regular values (hence the NULLS FIRST and NULLS LAST clauses for sorting). It is not quite obvious whether NULL values must be taken into account by aggregate functions. Strictly speaking, NULL values are not values at all, so the planner requires additional information to process them.
然而,特殊的价值需要特殊的对待。除了理论上的不一致之外,还必须考虑多种实际挑战。常规布尔逻辑被三值逻辑取代,因此 NOT IN 的行为会出乎意料。目前尚不清楚 NULL 值是否应被视为大于或小于常规值(因此使用 NULLS FIRST 和 NULLS LAST 子句进行排序)。聚合函数是否必须考虑 NULL 值并不十分明显。严格来说,NULL 值根本不是值,因此规划器需要额外的信息来处理它们。

Apart from the simplest basic statistics collected at the relation level, the analyzer also gathers statistics for each column of the relation. This data is stored in the pg_statistic table of the system catalog, but you can also access it via the pg_stats view, which provides this information in a more convenient format.
除了在关系级别收集最简单的基本统计信息之外,分析器还收集关系的 每一列的统计信息。此数据存储在系统目录的 pg_statistic 表中,但您也可以通过 pg_stats 视图访问它,该视图以更方便的格式提供此信息。

The fraction of NULL values belongs to column-level statistics; calculated during the analysis, it is shown as the null_frac attribute.
NULL值的比例属于列级统计;在分析过程中计算,它显示为 null_frac 属性。

For example, when searching for the flights that have not departed yet, we can rely on their departure times being undefined:
例如,当搜索尚未起飞的航班时,我们可以依靠它们的出发时间未定义:

=>  EXPLAIN SELECT  * FROM flights  WHERE actual_departure  IS NULL;
                            QUERY PLAN
 ---------------------------------------------------------------
  Seq Scan on flights   (cost=0.00..4772.67 rows=16702 width=63)
    Filter: (actual_departure IS NULL)
 (2 rows)

To estimate the result, the planner multiplies the total number of rows by the frac- tion of NULL values:
为了估计结果,规划器将 总行数 乘以 NULL 值的比例:

 => SELECT round(reltuples * s.null_frac)   AS rows
 FROM pg_class
   JOIN pg_stats s  ON s.tablename = relname
 WHERE s.tablename = 'flights'
   AND s.attname = 'actual_departure';

  rows
 --------------------
  16702
 (1 row)

And here is the actual row count:
这是实际的行数:

 => SELECT count(*)  FROM flights  WHERE actual_departure  IS NULL;

  count
 --------------------
  16348
 (1 row)

17.3 Distinct Values

The n_distinct field of the pg_stats view shows the number of distinct values in a column.
pg_stats 视图的 n_distinct 字段显示列中不同值的数量。

If n_distinct is negative, its absolute value denotes the fraction of distinct values in a column rather than their actual count. For example, -1 indicates that all col- umn values are unique, while -3 means that each value appears in three rows on average. The analyzer uses fractions if the estimated number of distinct values ex- ceeds 10 % of the total row count; in this case, further data updates are unlikely to change this ratio.
如果 n_distinct 为负数,则其 绝对值 表示列中不同值的分数,而不是它们的实际计数。例如,-1 表示所有列值都是唯一的,而 -3 表示每个值平均出现在三行中。如果不同值的估计数量超过总行数的 10%,则分析器将使用分数;在这种情况下,进一步的数据更新不太可能改变这个比率。

If uniform data distribution is expected, the number of distinct values is used in- stead. For example, when estimating the cardinality of the “column = expression” condition, the planner assumes that the expression can take any column value with equal probability if its exact value is unknown at the planning stage:
如果期望数据分布均匀,则使用不同值的数量。例如,在估计“列 = 表达式”条件的基数时,规划器假定如果在规划阶段未知其确切值,则表达式可以以相同的概率取任何列值:

=>  EXPLAIN SELECT  *
FROM  flights
WHERE  departure_airport = (
   SELECT airport_code  FROM airports  WHERE city = 'Saint Petersburg'
);
                               QUERY PLAN
-----------------------------------------------------------------------
 Seq Scan on flights    (cost=30.56..5340.40 rows=2066 width=63)
    Filter: (departure_airport = $0)
    InitPlan 1 (returns $0)
      ?> Seq Scan on airports_data ml    (cost=0.00..30.56 rows=1 wi...
          Filter: ((city ->> lang()) = 'Saint Petersburg'::text)
(5 rows)

Here the InitPlan node is executed only once, and the calculated value is used in the main plan.
这里InitPlan节点只执行一次,计算出的值用于主计划中。

 => SELECT round(reltuples / s.n_distinct)   AS  rows
 FROM pg_class
   JOIN pg_stats s  ON s.tablename = relname
 WHERE s.tablename = 'flights'
   AND s.attname = 'departure_airport';

  rows
 --------------------
  2066
 (1 row)

If the estimated number of distinct values is incorrect (because a limited number of rows have been analyzed), it can be overridden at the column level:
如果估计的不同值数量不正确(因为分析了有限数量的行),则可以在列级别覆盖它:

 ALTER TABLE ...
   ALTER COLUMN  ...
   SET (n_distinct = ...);

If all data always had uniform distribution, this information (coupled with minimal and maximal values) would be sufficient. However, for non-uniform distribution (which is much more common in practice), such estimation is inaccurate:
如果所有数据始终具有均匀分布,则此信息(加上最小值和最大值)就足够了。然而,对于非均匀分布(这在实践中更为常见),这种估计是不准确的:

 => SELECT min(cnt), round(avg(cnt)) avg, max(cnt)
 FROM (
   SELECT departure_airport, count(*) cnt
   FROM flights
   GROUP BY departure_airport
 ) t;
  min | avg  |   max
 -----+------+--------------------
  113 | 2066 | 20875
 (1 row)

17.4 Most Common Values

If data distribution is non-uniform, the estimation is fine-tuned based on statistics on most common values (MCV) and their frequencies. The pg_stats view displays these arrays in the most_common_vals and most_common_freqs fields, respectively.
如果数据分布不均匀,则根据 最常见值 (MCV) 及其频率的统计数据对估计进行微调。 pg_stats 视图分别在most_common_valsmost_common_freqs 字段中显示这些数组

Here is an example of such statistics on various types of aircraft:
以下是各种类型飞机的此类统计数据的示例:

=>  SELECT  most_common_vals     AS mcv,
  left(most_common_freqs::text,60) || '...'           AS mcf
FROM  pg_stats
WHERE  tablename = 'flights'      AND  attname = 'aircraft_code'    \gx
-[ RECORD 1 ]---------------------------------------------
mcv | {CN1,CR2,SU9,321,733,763,319,773}
mcf | {0.27886668,0.27266666,0.26176667,0.057166666,0.037666667,0....

To estimate the selectivity of the “column = value” condition, it is enough to find this value in the most_common_vals array and take its frequency from the most_common_freqs array element with the same index:
要估计“column = value”条件的选择性,只需在most_common_vals数组中找到该值,并从具有相同索引的most_common_freqs数组元素中获取其频率:

=>  EXPLAIN SELECT    * FROM  flights   WHERE  aircraft_code = '733';
                                QUERY PLAN
---------------------------------------------------------
 Seq Scan on flights       (cost=0.00..5309.84 rows=8093 width=63)
    Filter: (aircraft_code = '733'::bpchar)
(2 rows)
 => SELECT round(reltuples * s.most_common_freqs[
   array_position((s.most_common_vals::text::text[]),'733')
 ])
 FROM pg_class
   JOIN pg_stats s  ON s.tablename = relname
 WHERE s.tablename = 'flights'
   AND s.attname = 'aircraft_code';
  round
 --------------------
   8093
 (1 row)


-------------


在一个指定的数组中查找指定的元素,并返回出现的下标

array_positions(ARRAY['A','A','B','A'], 'A')	 结果 {1,2,4}

SELECT array_position(ARRAY[0, 1, 2], 1);

 array_position
----------------
              2

It is obvious that such estimation will be close to the actual value:
显然,这样的估计将接近实际值:

 => SELECT count(*)  FROM flights  WHERE aircraft_code = '733';
  count
 --------------------
   8263
 (1 row)

The MCV list is also used to estimate selectivity of inequality conditions. For ex- ample, a condition like “column < value” requires the analyzer to search through most_common_vals for all the values that are smaller than the target one and sum up the corresponding frequencies listed in most_common_freqs.
MCV 列表还用于估计不等式条件的选择性。例如,像“column < value”这样的条件要求分析器在most_common_vals中搜索所有小于目标值的值,并对most_common_freqs中列出的相应频率求和。

MCV statistics work best when distinct values are not too many. The maximum size of arrays is defined by the default_statistics_target parameter, which also limits the number of rows to be randomly sampled for the purpose of analysis.
当不同值不是太多时,MCV 统计效果最好。数组的最大大小由default_statistics_target参数定义,该参数还限制了出于分析目的而随机采样的行数。

In some cases, it makes sense to increase the default parameter value, thus ex- panding the MCV list and improving the accuracy of estimations. You can do it at the column level:
在某些情况下,增大 默认参数值是有意义的,从而扩大 MCV 列表并提高估计的准确性。您可以在列级别执行此操作:

 ALTER TABLE ...
   ALTER COLUMN  ...
   SET STATISTICS  ...;

The sample size will also grow, but only for the specified table.
样本大小也会增加,但仅限于指定的表。

Since the MCV array stores actual values, it may take quite a lot of space. To keep the pg_statistic size under control and avoid loading the planner with useless work, values that are larger than 1 kB are excluded from analysis and statistics. But since such large values are likely to be unique, they would probably not make it into most_common_vals anyway.
由于MCV数组存储的是实际值,因此可能会占用相当多的空间。为了控制 pg_statistic 大小并避免给规划器加载无用的工作,大于 1 kB 的值将被排除在分析和统计之外。但由于如此大的值可能是唯一的,因此它们可能不会进入most_common_vals。

17.5 Histogram

If distinct values are too many to be stored in an array, PostgreSQL employs a histogram. In this case, values are distributed between several buckets of the his- togram. The number of buckets is also limited by the default_statistics_target pa- rameter.
如果不同的值太多而无法存储在数组中,PostgreSQL 会使用直方图。在这种情况下,值分布在直方图的几个桶之间。桶的数量也受到default_statistics_target参数的限制。

The bucket width is selected in such a way that each bucket gets approximately the same number of values (this property is reflected in the diagram by the equality of areas of big hatched rectangles). The values included into MCV lists are not taken into account. As a result, the cumulative frequency of values in each bucket equals ( 1/number of buckets) The histogram is stored in the histogram_bounds field of the pg_stats view as an array of buckets’ boundary values:
存储桶宽度的选择方式 应使每个存储桶获得大致相同数量的值(该属性在图中通过大阴影矩形的面积相等来反映)。不考虑 MCV 列表中包含的值。因此,每个桶中值的累积频率等于( 1/桶数) 直方图作为桶边界值数组存储在 pg_stats 视图的 histogram_bounds 字段中:

=>  SELECT left(histogram_bounds::text,60) || '...'    AS hist_bounds
FROM  pg_stats s
WHERE  s.tablename = 'boarding_passes'   AND s.attname = 'seat_no';
                             hist_bounds
----------------------------------------------------------------------
 {10B,10E,10F,10F,11H,12B,13B,14B,14H,15G,16B,17B,17H,19B,19B...
(1 row)

Combined with the MCV list, the histogram is used for operations like estimating the selectivity of greater than and less than conditions.1 For example, let’s take a look at the number of boarding passes issued for back rows:
结合 MCV 列表,柱状图可用于估计大于和小于条件的选择性等操作。 例如,我们看一下后排签发的登机牌数量:

 => EXPLAIN SELECT  * FROM boarding_passes  WHERE  seat_no > '30B';


                               QUERY PLAN
 ------------------------------------------------------------------
  Seq Scan on boarding_passes    (cost=0.00..157350.10 rows=2983242 ...
    Filter: ((seat_no)::text > '30B'::text)
 (2 rows)

I have intentionally selected the seat number that lies right on the boundary be- tween two histogram buckets.
我特意选择了位于两个直方图桶之间边界上的座位号。

The selectivity of this condition will be estimated at (N/number of buckets ) , where N is the number of buckets holding the values that satisfy the condition (that is, the ones located to the right of the specified value). It must also be taken into account that MCVS are not included into the histogram.
此条件的选择性将估计为 (N/桶数 ) ,其中 N 是保存满足条件的值的桶的数量(即位于指定值右侧的桶的数量)。还必须考虑到 MCVS 不包含在直方图中。

Incidentally, NULLS values do not appear in the histogram either, but the seat_no column contains no such values anyway:
顺便说一句,NULLS 值也不会出现在直方图中,但 Seat_no 列无论如何都不包含此类值:

 => SELECT s.null_frac  FROM pg_stats s
 WHERE s.tablename = 'boarding_passes'   AND s.attname = 'seat_no';
 null_frac
--------------------
           0

(1 row)

First, let’s find the fraction of MCVS that satisfy the condition:
首先,我们求出满足条件的 MCVS 的分数:

=>  SELECT  sum(s.most_common_freqs[
  array_position((s.most_common_vals::text::text[]),v)
])
FROM  pg_stats s, unnest(s.most_common_vals::text::text[]) v
WHERE  s.tablename = 'boarding_passes'        AND  s.attname = 'seat_no'
  AND  v > '30B';
     sum
--------------------
 0.21226665
(1 row)

The overall MCV share (ignored by the histogram) is:
总体 MCV 份额(被直方图忽略)为:

=>  SELECT  sum(s.most_common_freqs[
  array_position((s.most_common_vals::text::text[]),v)
])
FROM  pg_stats s, unnest(s.most_common_vals::text::text[]) v
WHERE  s.tablename = 'boarding_passes'        AND  s.attname = 'seat_no';
     sum
--------------------
 0.67816657
(1 row)

Since the values that conform to the specified condition take exactly N buckets (out of 100 buckets possible), we get the following estimation:
由于符合指定条件的值恰好占用 N 个桶(可能有 100 个桶),因此我们得到以下估计:

=>  SELECT  round( reltuples * (
     0.21226665   -- MCV share
+ (1 - 0.67816657 - 0) * (51 / 100.0)          -- histogram share
  ))
  FROM  pg_class
  WHERE  relname = 'boarding_passes';
  round
--------------------
 2983242
(1 row)

In the generic case of non-boundary values, the planner applies linear interpola- tion to take into account the fraction of the bucket that contains the target value.
在非边界值的一般情况下,规划器应用线性插值来考虑包含目标值的桶的部分。

Here is the actual number of back seats:
后排实际座位数如下:

 => SELECT count(*)  FROM boarding_passes  WHERE  seat_no > '30B';
   count
 --------------------
  2993735
 (1 row)

As you increase the default_statistics_target value, estimation accuracy may im- prove, but as our example shows, the histogram combined with the MCV list usually gives good results even if the column contains many unique values:
当您增加 default_statistics_target 值时,估计精度可能会提高,但正如我们的示例所示,即使列包含许多唯一值,直方图与 MCV 列表相结合通常也会给出良好的结果:

 => SELECT n_distinct  FROM pg_stats
 WHERE tablename = 'boarding_passes'   AND attname = 'seat_no';
  n_distinct
 -------------------
         461

 (1 row)

It makes sense to improve estimation accuracy only if it leads to better planning. Increasing the default_statistics_target value without giving it much thought may slow down planning and analysis without bringing any benefits in return. That said, reducing this parameter value (down to zero) can lead to a bad plan choice, even though it does speed up planning and analysis. Such savings are usually un- justified.
只有提高估算精度才能更好地进行规划,才有意义。不加深思熟虑地增加default_statistics_target值可能会减慢规划和分析速度,而不会带来任何好处。也就是说,减小此参数值(降至零)可能会导致错误的计划选择,即使它确实加快了规划和分析的速度。这种节省通常是不合理的。

17.6 Statistics for Non-Scalar Data Types

For non-scalar data types, PostgreSQL can gather statistics not only on the distri- bution of values, but also on the distribution of elements used to construct these values. It improves planning accuracy when you query columns that do not con- form to the first normal form.
对于 非标量数据类型,PostgreSQL 不仅可以收集值分布的统计信息,还可以收集用于构造这些值的元素分布的统计信息。当您查询不符合第一范式的列时,它可以提高计划的准确性。

  • The most_common_elems and most_common_elem_freqs arrays show the list of the most common elements and the frequency of their usage. These statistics are collected and used to estimate selectivity of operations on arrays and tsvector data types.
    most_common_elems 和most_common_elem_freqs 数组显示最常见元素的列表及其使用频率。收集这些统计数据并用于估计 arrays 和 tsvector 数据类型上的操作选择性。
  • The elem_count_histogram array shows the histogram of the number of distinct elements in a value. This data is collected and used for estimating selectivity of operations on ar- rays only.
    elem_count_histogram 数组显示值中不同元素数量的直方图。该数据的收集和使用仅用于估计阵列操作的选择性。
  • For range types, PostgreSQL builds distribution histograms for range length and lower and upper boundaries of the range. These histograms are used for estimating selectivity of various operations on these types, but the pg_stats view does not display them.
    对于范围类型,PostgreSQL 为范围长度以及范围的上下边界构建分布直方图。这些直方图用于估计这些类型上各种操作的选择性, 但 pg_stats 视图不显示它们。

Similar statistics are also collected for multirange data types.
还针对多范围数据类型收集了类似的统计数据。

17.7 Average Field Width

The avg_width field of the pg_stats view shows the average size of values stored in a column. Naturally, for types like integer or char(3) this size is always the same, but for data types of variable length, such as text, it can vary a lot from column to column:
pg_stats 视图的 avg_width 字段显示列中存储的值的平均大小。当然,对于像整数或 char(3) 这样的类型,这个大小总是相同的,但是对于可变长度的数据类型,例如文本,它在列与列之间可能有很大差异:

 => SELECT attname, avg_width  FROM  pg_stats
 WHERE (tablename, attname)  IN  ( VALUES
   ('tickets', 'passenger_name'), ('ticket_flights','fare_conditions')
 );
      attname      | avg_width
 ------------------+--------------------
  fare_conditions  |        8
  passenger_name   |        16
 (2 rows)

This statistic is used to estimate the amount of memory required for operations like sorting or hashing.
此统计数据用于估计排序或散列等操作所需的内存量。

17.8 Correlation

The correlation field of the pg_stats view shows the correlation between the physical order of data and the logical order defined by comparison operations. If values are stored strictly in ascending order, their correlation will be close to 1; if they are arranged in descending order, their correlation will be close to -1. The more chaotic is data distribution on disk, the closer is the correlation to zero.
pg_stats视图的 correlation字段显示了 数据的物理顺序和比较操作定义的逻辑顺序之间的相关性。如果值严格按升序存储,则它们的相关性将接近 1;如果按降序排列,它们的相关性将接近-1。磁盘上的数据分布越混乱,相关性越接近于零。

 => SELECT attname, correlation
 FROM pg_stats  WHERE tablename = 'airports_data'
 ORDER BY abs(correlation)  DESC;
    attname     | correlation
 ---------------+--------------------
  coordinates   |
  airport_code  | -0.21120238
  city          | -0.1970127
  airport_name  | -0.18223621
  timezone      |  0.17961165
 (5 rows)

Note that this statistic is not gathered for the coordinates column: less than and greater than operators are not defined for the point type.
请注意,不会为 coordinates列收集此​​统计信息:没有为 point 类型定义小于和大于运算符。

Correlation is used for cost estimation of index scans.
相关性用于索引扫描的成本估计

17.9 Expression Statistics

Column-level statistics can be used only if either the left or the right part of the comparison operation refers to the column itself and does not contain any expres- sions. For example, the planner cannot predict how computing a function of a column will affect statistics, so for conditions like “function-call = constant” the selectivity is always estimated at 0.5%:
仅当比较操作的左侧或右侧部分引用列本身并且不包含任何表达式时,才能使用列级统计信息。例如,规划器无法预测计算列的函数将如何影响统计数据,因此对于“函数调用=常量”之类的条件,选择性始终估计为 0.5%:

=>  EXPLAIN SELECT  * FROM flights
WHERE  extract(
   month FROM scheduled_departure   AT TIME ZONE 'Europe/Moscow'
) = 1;
                               QUERY PLAN
------------------------------------------------------------------
 Seq Scan on flights    (cost=0.00..6384.17 rows=1074 width=63)
    Filter: (EXTRACT(month FROM (scheduled_departure AT TIME ZONE ...
(2 rows)
=>  SELECT round(reltuples * 0.005)
FROM  pg_class WHERE  relname = 'flights';
 round
--------------------
   1074
(1 row)

The planner knows nothing about semantics of functions, even of standard ones. Our general knowledge suggests that the flights performed in January will make roughly 1/12 of the total number of flights, which exceeds the projected value byone order of magnitude.
规划器 对函数的语义一无所知,甚至对标准函数的语义也一无所知。据了解,1月份执行的航班数量大约占航班总数的1/12,比预计值高出一个数量级。

To improve the estimation, we have to collect expression statistics rather than rely on the column-level one. There are two ways to do it.
为了改进估计,我们必须收集 表达式统计信息,而不是依赖于列级统计信息。有两种方法可以做到这一点。

17.9.1 Extended Expression Statistics

扩展表达统计

The first option is to use extended expression statistics. Such statistics are not col- lected by default; you have to manually create the corresponding database object by running the CREATE STATISTICS command:
第一个选项是使用扩展表达式统计信息。 默认情况下不收集此类统计信息;必须通过运行 CREATE STATISTICS 命令手动创建相应的数据库对象

 => CREATE STATISTICS  flights_expr  ON (extract(
     month FROM  scheduled_departure  AT TIME ZONE  'Europe/Moscow'
 ))
 FROM flights;

Once the data is gathered, the estimation accuracy improves:
收集数据后,估计精度就会提高:

 => ANALYZE flights;
 => EXPLAIN SELECT  * FROM flights
 WHERE extract(
   month FROM scheduled_departure   AT TIME ZONE  'Europe/Moscow'
 ) = 1;
                               QUERY PLAN
 --------------------------------------------------------------------
  Seq Scan on flights   (cost=0.00..6384.17 rows=16667 width=63)
    Filter: (EXTRACT(month FROM (scheduled_departure AT TIME ZONE ...
 (2 rows)

For the collected statistics to be applied, the query must specify the expression in exactly the same form that was used by the CREATE STATISTICS command.
为了应用收集的统计信息,查询必须以与 CREATE STATISTICS 命令所使用的形式完全相同的形式指定表达式。

The size limit for extended statisticscan be adjusted separately, by running the ALTER STATISTICS command. For example: 扩展统计的大小限制可以单独调整,通过运行 ALTER STATISTICS命令。例如:

 => ALTER STATISTICS  flights_expr  SET STATISTICS  42;

All the metadata related to extended statistics is stored in the pg_statistic_ext table of the system catalog, while the collected data itself resides in a separate table called pg_statistic_ext_data. This separation is used to implement access control for sensitive information. 所有与扩展统计信息相关的元数据都存储在系统目录的 pg_statistic_ext 表中,而收集的数据本身驻留在一个名为 pg_statistic_ext_data 的单独表中。这种分离用于实现对敏感信息的访问控制。

Extended expression statistics available to a particular user can be displayed in a more convenient format in a separate view:
特定用户可用的扩展表达式统计信息可以在单独的视图中以更方便的格式显示:

=>  SELECT  left(expr,50) || '...'      AS  expr,
  null_frac, avg_width, n_distinct,
  most_common_vals     AS  mcv,
  left(most_common_freqs::text,50) || '...'           AS mcf,
  correlation
FROM  pg_stats_ext_exprs
WHERE  statistics_name = 'flights_expr'        \gx
-[ RECORD 1 ]------------------------------------------------------
expr           | EXTRACT(month FROM (scheduled_departure AT TIME ZO...
null_frac      | 0
avg_width      | 8
n_distinct     | 12
mcv            | {8,9,12,3,1,5,6,7,11,10,4,2}
mcf            | {0.12053333,0.11326667,0.0802,0.07976667,0.0775666...
correlation    | 0.08355749

17.9.2 Statistics for Expression Indexes

表达索引统计

Another way to improve cardinality estimation is to use special statistics collected for expression indexes; these statistics are gathered automatically when such an index is created, just like it is done for a table. If the index is really needed, this approach turns out to be very convenient.
改进基数估计的另一种方法是使用为表达索引收集的特殊统计数据;创建此类索引时会自动收集这些统计信息,就像对表所做的那样。如果确实需要索引,这种方法会非常方便。

=>  DROP STATISTICS    flights_expr;
=>  CREATE INDEX ON    flights(extract(
  month   FROM  scheduled_departure  AT TIME ZONE 'Europe/Moscow'
));
=>  ANALYZE  flights;
=>  EXPLAIN SELECT    * FROM  flights
WHERE  extract(
  month   FROM  scheduled_departure     AT TIME ZONE 'Europe/Moscow'
) = 1;
                                   QUERY PLAN
------------------------------------------------------------------
 Bitmap Heap Scan on flights    (cost=324.86..3247.92 rows=17089 wi...
    Recheck Cond: (EXTRACT(month FROM (scheduled_departure AT TIME...
    -> Bitmap Index Scan on flights_extract_idx   (cost=0.00..320.5...
         Index Cond: (EXTRACT(month FROM (scheduled_departure AT TI...
(4 rows)

Statistics on expression indexes are stored in the same way as statistics on tables. For example, you can get the number of distinct values by specifying the index name as tablename when querying pg_stats:
表达式索引的统计信息的存储方式与表的统计信息相同。例如,在查询 pg_stats 时,可以通过指定索引名称为 tablename 来获取不同值的数量:

 => SELECT n_distinct  FROM pg_stats
 WHERE tablename = 'flights_extract_idx';
  n_distinct
 --------------------
          12

 (1 row)

You can adjust the accuracyof index-related statistics using the ALTER INDEX command. If you do not know the column name that corresponds to the indexed ex- pression, you have to first find it out. For example: 您可以使用 ALTER INDEX 命令 调整索引相关统计信息的准确性 要求。如果您不知道索引表达式对应的列名,则必须首先找到它。例如:

 => SELECT attname  FROM pg_attribute
 WHERE attrelid = 'flights_extract_idx'::regclass;

  attname
 --------------------
  extract
 (1 row)
 => ALTER INDEX  flights_extract_idx
   ALTER COLUMN  extract SET STATISTICS  42;

17.10 Multivariate Statistics

It is also possible to collect multivariate statistics, which span several table columns. As a prerequisite, you have to manually create the corresponding extended statis- tics using the CREATE STATISTICS command.
还可以收集跨多个表列的多变量统计信息。作为先决条件,您必须使用 CREATE STATISTICS 命令手动创建相应的扩展统计信息。

PostgreSQL implements three types of multivariate statistics.

17.10.1 Functional Dependencies Between Columns

If values in one column depend (fully or partially) on values in another column and the filter conditions include both these columns, cardinality will be underes- timated.
如果一列中的值(完全或部分)依赖于另一列中的值,并且过滤条件包括这两列,则基数将被低估。

Let’s consider a query with two filter conditions:
让我们考虑一个具有两个过滤条件的查询:

=>  SELECT count(*)  FROM flights
WHERE  flight_no = 'PG0007'  AND departure_airport = 'VKO';
 count
-------------------
    396

(1 row)

The value is hugely underestimated:
其值被严重低估:

=>  EXPLAIN SELECT  * FROM flights
WHERE  flight_no = 'PG0007'  AND departure_airport = 'VKO';
                               QUERY PLAN
----------------------------------------------------------------
 Bitmap Heap Scan on flights    (cost=10.49..816.84 rows=15 width=63)
    Recheck Cond: (flight_no = 'PG0007'::bpchar)
    Filter: (departure_airport = 'VKO'::bpchar)
    -> Bitmap Index Scan on flights_flight_no_scheduled_departure_key
        (cost=0.00..10.49 rows=276 width=0)
        Index Cond: (flight_no = 'PG0007'::bpchar)
(6 rows)

It is a well-known problem of correlated predicates. The planner assumes that pred- icates do not depend on each other, so the overall selectivity is estimated at the product of selectivities of filter conditions combined by logical AND The plan above clearly illustrates this issue: the value estimated by the Bitmap Index Scan node for the condition on the flight_no column is significantly reduced once the Bitmap Heap Scan node filters the results by the condition onthe departure_airport column.
这是一个众所周知的相关谓词问题。规划器假设谓词不相互依赖,因此整体选择性是根据逻辑 AND 组合的过滤条件选择性的乘积来估计的。上面的计划清楚地说明了这个问题:位图索引扫描节点估计的值一旦位图堆扫描节点根据出发机场列上的条件过滤结果,flight_no 列上的条件就会显着减少。

However, we do understand that airports are unambiguously defined by flight num- bers: the second condition is virtually redundant (unless there is a mistake in the airport name, of course). In such cases, we can improve the estimation by applying extended statistics on functional dependencies.
然而,我们确实知道机场是由航班号明确定义的:第二个条件实际上是多余的(当然,除非机场名称有误)。在这种情况下,我们可以通过对函数依赖性应用扩展统计来改进估计。

Let’s create an extended statistic on the functional dependency between the two columns:
让我们创建一个关于两列之间的函数依赖关系的扩展统计数据:

=>  CREATE STATISTICS  flights_dep(dependencies)
ON  flight_no, departure_airport  FROM  flights;

The next analysis run gathers this statistic, and the estimation improves:
下一次分析运行会收集此统计数据,并且估计会得到改进:

 => ANALYZE flights;
 => EXPLAIN SELECT  * FROM flights
 WHERE flight_no = 'PG0007'
   AND departure_airport = 'VKO';
                               QUERY PLAN
 --------------------------------------------------------------
  Bitmap Heap Scan on flights    (cost=10.57..819.51 rows=277 width=63)
    Recheck Cond: (flight_no = 'PG0007'::bpchar)
    Filter: (departure_airport = 'VKO'::bpchar)
    -> Bitmap Index Scan on flights_flight_no_scheduled_departure_key
        (cost=0.00..10.50 rows=277 width=0)
        Index Cond: (flight_no = 'PG0007'::bpchar)
 (6 rows)

The collected statistics is stored in the system catalog and can be accessed like this:
收集的统计信息存储在系统目录中,可以像这样访问:

 => SELECT dependencies
 FROM pg_stats_ext  WHERE statistics_name = 'flights_dep';
                 dependencies
 ------------------------------------------------------------
  {"2 => 5": 1.000000, "5 => 2": 0.010200}
 (1 row)

Here 2 and 5 are column numbers stored in the pg_attribute table, whereas the corresponding values define the degree of functional dependency: from 0 (no de- pendency) to 1 (values in the second columns fully depend on values in the first column).
这里2和5是存储在pg_attribute表中的列号,而相应的值定义了函数依赖的程度:从0(无依赖)到1(第二列中的值完全依赖于第一列中的值)。

17.10.2 Multivariate Number of Distinct Values

Statistics on the number of unique combinations of values stored in different columns improves cardinality estimation of a GROUP BY operation performed on several columns.
对存储在不同列中的值的唯一组合的数量的统计改进了对多个列执行的 GROUP BY 操作的基数估计。

For example, here the estimated number of possible pairs of departure and arrival airports is the square of the total number of airports; however, the actual value is much smaller, as not all the pairs are connected by direct flights:
例如,这里估计的可能的出发和到达机场对的数量是机场总数的平方;然而,实际值要小得多,因为并非所有航班都通过直飞航班连接:

=>  SELECT  count(*)
FROM  (
  SELECT DISTINCT     departure_airport, arrival_airport  FROM  flights
) t;
 count
--------------------
    618

(1 row)
=>  EXPLAIN SELECT DISTINCT departure_airport, arrival_airport
FROM  flights;
                                   QUERY PLAN
---------------------------------------------------------------------
 HashAggregate     (cost=5847.01..5955.16 rows=10816 width=8)
    Group Key: departure_airport, arrival_airport
    -> Seq Scan on flights     (cost=0.00..4772.67 rows=214867 width=8)
(3 rows)

Let’s define and collect an extended statistic on distinct values:
让我们定义并收集不同值的扩展统计数据:

=>  CREATE STATISTICS     flights_nd(ndistinct)
ON  departure_airport, arrival_airport        FROM  flights;
=>  ANALYZE  flights;

The cardinality estimation has improved:
基数估计得到了改进:

=>  EXPLAIN SELECT DISTINCT      departure_airport, arrival_airport
FROM  flights;
                                   QUERY PLAN
----------------------------------------------------------------------
 HashAggregate     (cost=5847.01..5853.19 rows=618 width=8)
    Group Key: departure_airport, arrival_airport
    -> Seq Scan on flights       (cost=0.00..4772.67 rows=214867 width=8)
(3 rows)

You can view the collected statistic in the system catalog:
您可以在系统目录中查看收集的统计信息:

=>  SELECT  n_distinct
FROM  pg_stats_ext    WHERE  statistics_name = 'flights_nd';

  n_distinct
--------------------
 {"5, 6": 618}
(1 row)

17.10.3 Multivariate MCV Lists

If the distribution of values is non-uniform, it may be not enough to rely on the functional dependency alone, as the estimation accuracy will highly depend on a particular pair of values. For example, the planner underestimates the number of flights performed by Boeing 737 from Sheremetyevo airport:
如果值的分布不均匀,则仅依赖函数依赖性可能不够,因为估计精度将高度依赖于特定的值对。例如,规划者低估了从 谢列梅捷沃机场 起飞的波音 737 的航班数量:

=>  SELECT   count(*)  FROM   flights
WHERE   departure_airport = 'SVO'      AND  aircraft_code = '733';
  count

--------------------

   2037
(1 row)
=>  EXPLAIN SELECT    *  FROM  flights
WHERE   departure_airport = 'SVO'      AND  aircraft_code = '733';
                                    QUERY PLAN

----------------------------------------------------------------------

  Seq Scan on flights      (cost=0.00..5847.00 rows=736 width=63)
    Filter: ((departure_airport = 'SVO'::bpchar) AND (aircraft_cod...
(2 rows)

In this case, you can improve the estimation by collecting statistics on multivariate MCV lists:
在这种情况下,您可以通过收集多元 MCV 列表的统计信息来改进估计:

=>  CREATE STATISTICS     flights_mcv(mcv)
ON  departure_airport, aircraft_code        FROM  flights;
=>  ANALYZE   flights;

The new cardinality estimation is much more accurate:
新的基数估计更加准确:

=>  EXPLAIN SELECT    *  FROM  flights
WHERE   departure_airport = 'SVO'      AND  aircraft_code = '733';
                                    QUERY PLAN

-------------------------------------------------------------------

  Seq Scan on flights      (cost=0.00..5847.00 rows=1927 width=63)
    Filter: ((departure_airport = 'SVO'::bpchar) AND (aircraft_cod...
(2 rows)

To get this estimation, the planner relies on the frequency values stored in the system catalog:
为了获得此估计,规划器依赖于系统目录中存储的频率值:

=>  SELECT values, frequency
FROM  pg_statistic_ext stx
   JOIN pg_statistic_ext_data stxd   ON stx.oid = stxd.stxoid,
   pg_mcv_list_items(stxdmcv) m
WHERE  stxname = 'flights_mcv'
AND  values = '{SVO,773}';
   values   |       frequency
------------+--------------------------------
 {SVO,773}  | 0.005266666666666667
(1 row)

Just like a regular MCV list, a multivariate list holds default_statistics_target values (if this parameter is also set at the column level, the largest of its values is used). If required, you can also change the size of the list, like it is done for extended expression statistics:
就像常规 MCV 列表一样,多变量列表保存 default_statistics_target 值(如果也在列级别设置此参数,则使用其值中的最大值)。如果需要,您还可以更改列表的大小,就像扩展表达式统计信息一样:

ALTER STATISTICS   ... SET STATISTICS  ...;

In all these examples, I have used only two columns, but you can collect multivari- ate statistics on a larger number of columns too. To combine statistics of several types in one object, you can provide a comma- separated list of these types in its definition. If no type is specified, PostgreSQL will collect statistics of all the possible types for the specified columns. Apart from the actual column names, multivariate statistics can also use arbitrary expressions, just like expression statistics.
在所有这些示例中,我仅使用了两列,但您也可以收集更多列的多元统计信息。要在一个对象中合并多种类型的统计信息,您可以在其定义中提供这些类型的逗号分隔列表。如果没有指定类型,PostgreSQL将收集指定列的所有可能类型的统计信息。除了实际的列名之外,多元统计还可以使用任意表达式,就像表达式统计一样。


相关文档:

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
其他
1
https://gitee.com/xuejianxinokok/mysqldoc.git
git@gitee.com:xuejianxinokok/mysqldoc.git
xuejianxinokok
mysqldoc
mysqldoc
master

搜索帮助