7 Star 31 Fork 21

xuejianxinokok/mysqldoc

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
03_Pages and Tuples.md 64.90 KB
一键复制 编辑 原始数据 按行查看 历史
xuejianxinokok 提交于 4个月前 . 4444

3 Pages and Tuples 页和元组

3.1 Page Structure 页结构

下图显示了 表->数据文件->页 之间的关系

一个 表 包含多个 数据文件,一个数据文件包含多个页,一个页内包含多行数据。

默认情况下,Postgres 中的所有表数据都使用 “堆”方法 进行物理存储。因此,每个数据库都是一组 1Gb 文件(“段”),每个文件在逻辑上分为 8Kb 页面。实际的表行被放入任何具有足够可用空间的 页中。

PostgreSQL中的每个表(TABLE)都将由一个或多个 堆文件表示。默认情况下,表的每个1GB块(Block)存储在一个单独的(堆)文件中。当该表文件已经达到1GB之后,用户再次插入数据时,postgres会重新创建一个新的堆文件,新文件名格式是:表Oid +“.” + 序号id(序号id从1开始一次递增)。示意图如下所示,其中tudentCREATE TABLE创建的表名,该student表对应的堆文件名是16387。

在每个数据文件(堆文件、索引文件、FSM文件、VM文件等)内部,它分为固定长度的页(或块)。换言之,即一个1GB大小的表文件内部是有若干个大小固定的页组成。页的默认大小为8192字节(8KB)。单个表文件中的这些页(Page)从0开始进行顺序编号,这些编号也称为“块编号(Block Numbers)”。如果第一页空间已经被数据填满,则postgres会立刻重新在文件末尾(即已填满页的后面)添加一个新的空白页,用于继续存储数据,一直持续这个过程,直到当前表文件大小达到1GB位置。若文件达到1GB,则重新创建一个新的表文件,然后重复上面的这个过程。

每一个数据文件被称为一个数据段(Segment)文件,符号 . 后就是段号(segno),第 0 个分段文件没有段号。

在分配和读取数据文件的时候,为了效率,一般会以块为单位, 在 Postgres 中默认块大小是 8KB,可以在系统初始化时设置。

  /* Size of a disk block --this also limits the size of a tuple. You can set
     it bigger if you need bigger tuples (although TOAST should reduce the need
     to have large tuples, since fields can be spread across multiple tuples).
     BLCKSZ must be a power of 2. The maximum possible value of BLCKSZ is
     currently 2^15 (32768). This is determined by the 15-bit widths of the
     lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h).
     Changing BLCKSZ requires an initdb. */
  #define BLCKSZ 8192

还有另一个参数是 RELSEG_SIZE (关系 段 大小),表示一个文件中的最大块数量,使用这个参数和 BLOCKZ,就能够计算出每个数据文件的最大值,目前默认是 1GB(131072 * 8192 / 1024 / 1024 / 1024 = 1GB)。

  /* RELSEG_SIZE is the maximum number of blocks allowed in one disk file. Thus,
     the maximum size of a single file is RELSEG_SIZE * BLCKSZ; relations bigger
     than that are divided into multiple files. RELSEG_SIZE * BLCKSZ must be
     less than your OS' limit on file size. This is often 2 GB or 4GB in a
     32-bit operating system, unless you have large file support enabled. By
     default, we make the limit 1 GB to avoid any possible integer-overflow
     problems within the OS. A limit smaller than necessary only means we divide
     a large relation into more chunks than necessary, so it seems best to err
     in the direction of a small limit. A power-of-2 value is recommended to
     save a few cycles in md.c, but is not absolutely required. Changing
     RELSEG_SIZE requires an initdb. */
  #define RELSEG_SIZE 131072

每个块都有一个唯一的标识,叫做 block number。它们依次递增,连续的存储在文件里。每个文件分片包含的块数目是相同的。下面展示了根据 block number 找到对应的文件segno ,用块号 直接可以定位到文件 以及从该文件的偏移位置

// blkno是块的唯一标识,RELSEG_SIZE是切片包含的块数目
targetseg = blkno / ((BlockNumber) RELSEG_SIZE);

Postgresql 文件存储层 | 学习笔记

Postgres 源码学习 3—Postgres 存储管理器

然后每个页的内部又由一个页头(Page Header)、若干行指针(Line Pointer)、若干个元组数据(Heaple Tuple)组成。单个文件大小1GB(默认,可以修改其大小),因为堆(重点将普通堆)文件是由页组成,所以可知一个堆文件中有:1GB = (1024 * 1024) KB / 8KB(Page) = 131072个页。

为什么需要行指针?

这个设计允许重新排序。页面外的实体通过 行指针 引用页面,只有页面本身知道相应项目的位置。这意味着,如果删除了某些项目,并且页面上有碎片,Postgres 可以在内部对项目进行碎片整理和重新排序,而不必担心外部引用.

Each page has a certain inner layout that usually consists of the following parts:
每个页都有一定的内部结构,通常由以下部分组成:

  • page header
  • an array of item pointers
    条目指针数组
  • free space 可用空间
  • items (row versions) 项目(行版本)
  • special space 特殊空间

Page Header

The page header is located in the lowest addresses and has a fixed size. It stores various information about the page, such as its checksum and the sizes of all the other parts of the page.
页头 位于 页内偏移低地址 并且具有固定的大小。它存储有关页的各种信息,例如其 校验和 以及页所有其他部分的大小。

These sizes can be easily displayed using the pageinspect extension. Let’s take a look at the first page of the table (page numbering is zero-based):
使用 pageinspect 插件可以查询页结构各部分的大小。让我们看一下表的第一页(页号从0开始):

 -- windows下这需要 lib/pageinspect.dll
 => CREATE EXTENSION  pageinspect;
 => SELECT lower, upper, special, pagesize
 FROM page_header(get_raw_page('accounts',0));
 -- 可以查看下图确定lower ,upper 的位置
  lower | upper | special | pagesize
 -------+-------+---------+--------------------
    152 |   6904|     8192|   8192
 (1 row)

相关结构定义如下:

// 页头
typedef struct PageHeaderData
{
    /* XXX LSN is member of *any* block, not only page-organized ones */
    PageXLogRecPtr pd_lsn;        /* LSN: next byte after last byte of xlog
                                 * record for last change to this page */
    uint16        pd_checksum;    /* checksum */
    uint16        pd_flags;        /* flag bits, see below */
    LocationIndex pd_lower;        /* offset to start of free space */
    LocationIndex pd_upper;        /* offset to end of free space */
    LocationIndex pd_special;    /* offset to start of special space */
    uint16        pd_pagesize_version;
    TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
    ItemIdData    pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;
// 行指针
typedef struct ItemIdData
{
    unsigned    lp_off:15,        /* offset to tuple (from start of page) */
                lp_flags:2,       /* state of line pointer, see below */
                lp_len:15;        /* byte length of tuple */
} ItemIdData;

typedef ItemIdData *ItemId;

/*
 * lp_flags has these possible states.  An UNUSED line pointer is available
 * for immediate re-use, the other states are not.
 */
#define LP_UNUSED        0        /* unused (should always have lp_len=0) */
#define LP_NORMAL        1        /* used (should always have lp_len>0) */
#define LP_REDIRECT      2        /* HOT redirect (should have lp_len=0) */
#define LP_DEAD          3        /* dead, may or may not have storage */

pd_lsn:最后更新此页面的xlog的lsn。buffer manager通过该值来保证wal的基本规则“wal日志应该先于数据页面落盘”。数据脏页的pd_lsn必须等到至少等于pd_lsn的xlog已经落盘,脏页本身才能落盘。在FlushBuffer函数中,会先获取该页面的pd_lsn,如果是非unlogged表的话,调用XLogFlush先将wal刷盘。

recptr = BufferGetLSN(buf);
if (buf_state & BM_PERMANENT)
        XLogFlush(recptr);

pd_checksum:如果开启了校验和,那么pd_checksum存储了该页面的校验和,0是一个有效的校验和。如果校验和未被使用,则我们不设置该字段。这通常意味着该字段为零,但如果数据库在9.3之前的版本中进行了pg_upgraded,则也可能存在非零值,当时该偏移记录的是页面上次被更新时的时间线。需要注意的是页面上没有关于校验和是否有效的标志位,这是一种经过深思熟虑的设计选择,可以避免依赖页面内容来决定是否验证的问题。因此,没有与校验和相关的标志位。

pd_flags:当新添加一条数据到 page 里,需要快速判断是否有空闲空间。page 的 pd_flags 记录了 page 是否有空闲空间,它的标记位如下

#define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */
#define PD_PAGE_FULL 0x0002 /* not enough free space for new tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to  everyone */
#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */

PD_HAS_FREE_LINES:是否还有未使用的 linp 指针,是否有空闲的数据指针

PD_PAGE_FULL:页面已满,剩余的空间无法容纳新的 Tuple

PD_ALL_VISIBLE:page 所有的 tuple 都是可见的

PD_VALID_FLAG_BITS:全部有效的 pd_flags 标记位

pd_lower:该 page 内空闲空间的起始位置

pd_upper:该 page 内空闲空间的结束位置

pd_prune_xid:一个标志位,索引页中该字段无用。它记录了页面上潜在可以prune的元组的最老的xid, 也就是页面上最早未删除 XMAX,如果没有则为0.

pd_pagesize_version:存储页面大小和版本信息,页面的版本号和size放在了一个字段中。这是历史原因,7.3以前没有页面版本号,所以将7.3以前的页面版本认为是0,我们将页面大小限制为256的倍数,保留低8位用于版本号。最小可能的页面大小可能是64B,以适应页面头、不透明空间和最小元组;我们只能支持最大32KB的页面,因为lp_off/lp_len是15位。

pd_linp:即前面注释中标注的 linp 1 linp 2 linp 3 … Linp n,是一个数组,用来标识 page 内一条数据的位置偏移,使用结构体 ItemIdData 表示。

  • lp_off 占 15 位,表示数据在 page 的偏移

  • lp_flags 占 2 位,表示状态,取值有这几种:

    /** lp_flags has these possible states. An UNUSED line pointer is available 
      * for immediate re-use, the other states are not. */ 
    
     #define LP_UNUSED 0   /* unused (should always have lp_len=0) */ 
     #define LP_NORMAL 1   /* used (should always have lp_len>0) */ 
     #define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */ 
     #define LP_DEAD 3     /* dead, may or may not have storage */
    

    LP_UNUSED:表示指针 空闲,未被使用

    LP_NORMAL:指向实际的偏移位置

    LP_REDIRECT:不指向实际数据,而是一个跳转,指向其他的 ItemIdData,用于 HOT(Heap Only Tuple),在页剪枝后用于指向别的行指针

    LP_DEAD:数据已经被删除

  • lp_length:数据的长度

Special Space 特殊空间

The special space is located in the opposite part of the page, taking its highest ad- dresses. It is used by some indexes to store auxiliary information; in other indexes and table pages this space is zero-sized.
特殊空间 和 页头正好相对(页头位于开始,特殊空间位于页尾),占据其最高位置。被一些索引(如GIN,HASH)用来存储辅助信息;在 其他索引和表 页中,特殊空间大小为零。为什么叫特殊空间? 就是为了使这些特殊索引页来特殊使用的。

In general, the layout of index pages is quite diverse; their content largely depends on a particular index type. Even one and the same index can have different kinds of pages: for example, B-trees have a metadata page of a special structure (page zero) and regular pages that are very similar to table pages.
一般来说,索引页 的布局是多种多样的;它们的内容很大程度上取决于索引类型。即使同一种索引也可以有不同类型的页:例如,B 树有一个特殊结构的元数据页(第 0 页)和与表页非常相似的常规页。

Tuples

Rows contain the actual data stored in the database, together with some additional information. They are located just before the special space.
行(items) 包含存储在数据库中的实际数据以及一些附加信息(其实就是内置字段)。它们位于 特殊空间 之前。

In the case of tables, we have to deal with row versions rather than rows because multiversion concurrency control implies having several versions of one and the same row. Indexes do not use this MVCC mechanism; instead, they have to ref- erence all the available row versions, falling back on visibility rules to select the appropriate ones.
对于表,我们必须处理 行版本 而不是行,因为多版本并发控制意味着同一行具有多个版本。索引不使用MVCC机制,只有一个版本;相反,索引必须引用所有可用的行版本,依靠可见性规则来选择合适的版本。

Both table row versions and index entries are often referred to as tuples. This term is borrowed from the relational theory—it is yet another legacy of PostgreSQL’s academic past.
表行版本和索引条目通常称为元组。这个术语是从关系理论借用的——它是 PostgreSQL 学术历史的又一遗产。

Item Pointers

The array of pointers to tuples serves as the page’s table of contents. It is located right after the header.
指向元组的 指针数组 充当 页目录。它位于 页(块)头 之后。

Index entries have to refer to particular heap tuples somehow. PostgreSQL em- ploys six-byte tuple identifiers (TIDS) for this purpose. Each TID consists of the page number of the main fork and a reference to a particular row version located in this page.
索引元组 必须以某种方式引用特定的堆元组。为此,PostgreSQL 使用 六字节元组标识符 (TID)。每个 TID 由 主分叉的 页号和对该页中 特定行版本的引用(指针) 组成,形式为(页号,指针编号)。

In theory, tuples could be referred to by their offset from the start of the page. But then it would be impossible to move tuples within pages without breaking these references, which in turn would lead to page fragmentation and other unpleasant consequences.
理论上,元组可以通过它们距 页开头的偏移量来引用。但是,在不破坏这些引用的情况下,不可能在页内移动元组,这反过来又会导致页碎片和其他令人不快的后果。

For this reason, PostgreSQL uses indirect addressing: a tuple identifier refers to the corresponding pointer number, and this pointer specifies the current offset ofthe tuple. If the tuple is moved within the page, its TID still remains the same; it is enough to modify the pointer, which is also located in this page.
为此,PostgreSQL 使用间接寻址:元组标识符引用相应的指针编号,该指针指定元组的当前偏移量(把偏移存储在页内,页外用 行指针编号访问)。如果元组在页内移动,它的TID仍然保持不变;修改指针就足够了,该指针也位于该页中。

Each pointer takes exactly four bytes and contains the following data:
每个指针恰好占用四个字节并包含以下数据:

  • tuple offset from the start of the page
    元组 距 页开头 的偏移量

  • tuple length

    元组长度

  • several bits defining the tuple status
    定义元组状态的几个位

Free Space 空闲空间

Pages can have some free space left between pointers and tuples (which is reflected in the free space map). There is no page fragmentation: all the free space available is always aggregated into one chunk.

页在指针和元组之间可以留有一些 空闲空间(这反映在 空闲空间映射 中)。没有页内碎片:所有可用空间 总是聚合成一个块。

3.2 Row Version Layout 行版本布局

Each row version contains a header followed by actual data. The header consists of multiple fields, including the following:
每个行版本都包含一个行头,后跟实际数据。行头由多个字段组成,包括以下内容:

  • xmin, xmax represent transaction IDS; they are used to differentiate between this and other versions of one and the same row.
    xmin、xmax代表事务ID;它们用于区分这一行和同一行的其他版本。

  • infomask provides a set of information bits that define version properties.
    infomask 提供一组 定义版本属性 的信息位。

  • ctidis a pointer to the next updated version of the same row.
    ctid 是指向 同一行 的下一个更新版本的指针。我猜测: c 可能是chain 意思

  • null bitmap is an array of bits marking the columns that can contain NULL values.

    null 位映射 是一个位数组,标记可以包含 NULL 值的列。

As a result, the header turns out quite big: it requires at least 23 bytes for each tuple, and this value is often exceeded because of the null bitmap and the mandatory padding used for data alignment. In a “narrow” table, the size of various metadata can easily beat the size of the actual data stored.
这些内置字段会导致元组头很大:每个元组至少需要 23 个字节,并且由于 null 位映射和用于数据对齐的强制填充 而经常超出该值。在“窄”表中,各种元数据的大小很容易超过实际存储数据的大小。

Data layout on disk fully coincides with data representation in RAM. The page along with its tuples is read into the buffer cache as is, without any transformations. That’s why data files are incompatible between different platforms.
磁盘上的数据布局与 RAM 中的数据表示完全一致。该页及其元组按原样读入缓冲区高速缓存,无需任何转换。这就是不同平台之间数据文件不兼容的原因。

One of the sources of incompatibility is the byte order. For example, the x86 ar- chitecture is little-endian, z/Architecture is big-endian, and ARM has configurable byte order.
不兼容的根源之一是字节顺序。例如,x86 架构是小端字节序,z/Architecture 是大端字节序,而 ARM 具有可配置的字节顺序。

Another reason is data alignment by machine word boundaries, which is required by many architectures. For example, in a 32-bit x86 system, integer numbers (the integer type, takes four bytes) are aligned by the boundary of four-byte words, just like double-precision floating-point numbers (the double precision type, eight bytes). But in a 32-bit system, double values are aligned by the boundary of eight- byte words.
另一个原因是按机器字边界进行数据对齐,这是许多体系结构所需要的。例如,在32位x86系统中,整数(整数类型,占用四个字节)按照四字节字的边界对齐,就像双精度浮点数(双精度类型,占用八个字节) )。但在 32 位系统中,双精度值按八字节字的边界对齐。

Data alignment makes the size of a tuple dependent on the order of fields in the table. This effect is usually negligible, but in some cases it can lead to asignificant size increase. Here is an example:
数据对齐使元组的大小取决于表中字段的顺序。这种影响通常可以忽略不计,但在某些情况下可能会导致大小显著增加。这是一个例子:

=>  CREATE TABLE padding(
   b1 boolean,
   i1 integer,
   b2 boolean,
   i2 integer
);
=>  INSERT INTO padding  VALUES (true,1,false,2);
=>  SELECT lp_len  FROM heap_page_items(get_raw_page('padding', 0));
 lp_len
--------------------
   40
(1 row)

I have used the heap_page_items function of the pageinspect extension to display some details about pointers and tuples.
我使用了 pageinspect 扩展的 heap_page_items 函数来显示有关指针和元组的一些详细信息。

In PostgreSQL, tables are often referred to as heap. This is yet another obscure term that hints at the similarity between space allocation for tuples and dynamic memory alloca- tion. Some analogy can certainly be seen, but tables are managed by completely different algorithms. We can interpret this term in the sense that “everything is piled up into a heap,” by contrast with ordered indexes.
在 PostgreSQL 中,表通常称为堆。这是另一个晦涩的术语,它暗示了元组的空间分配和动态内存分配之间的相似性。当然可以看出一些类比,但是表是通过完全不同的算法管理的。我们可以将这个术语解释为“所有东西都堆积成堆”,与有序索引形成对比。

The size of the row is 40 bytes. Its header takes 24 bytes, a column of the integer type takes 4 bytes, and boolean columns take 1 byte each. It makes 34 bytes, and 6 bytes are wasted on four-byte alignment of integer columns.
行的大小为 40 字节。它的头占用 24 个字节,整数类型的列占用 4 个字节,布尔型列各占用 1 个字节。它占用 34(=24+1+4+1+4) 个字节,其中 6 (=40-34) 个字节浪费在整数列的四字节对齐上。b1 和 b2 后需要跟3个字节的填充,填充一共浪费6个字节。

If we rebuild the table, the space will be used more efficiently:
如果我们重建表,空间将得到更有效的利用:

=>  DROP TABLE padding;
=>  CREATE TABLE padding(
   i1 integer,  --把相同类型的数据放到一起
   i2 integer,
   b1 boolean,
   b2 boolean
);
 => INSERT INTO  padding VALUES  (1,2,true,false);
 => SELECT lp_len  FROM heap_page_items(get_raw_page('padding', 0));
  lp_len
 --------------------
      34
 (1 row)

Another possible micro-optimization is to start the table with the fixed-length columns that cannot contain NULL values. Access to such columns will be more efficient because it is possible to cache their offset with in the tuple.
另一种可能的 微优化是在表开始的字段使用固定长度列 但这些列 不能包含 NULL 值。对此类列的访问将更加高效,因为可以将它们的偏移量缓存在元组内。

3.3 Operations on Tuples 元组的操作

To identify different versions of one and the same row, PostgreSQL marks each of them with two values: xmin and xmax. These values define “validity time” of each row version, but instead of the actual time, they rely on ever-increasing transaction IDS.
为了识别同一行的不同版本,PostgreSQL 用两个值标记每个版本:xmin 和 xmax。这些值定义了每个行版本的“有效时间”,但它们依赖于不断增加的事务 ID,而不是实际时间。

When a row is created, its xmin value is set to the transaction ID of the INSERT com- mand.
创建行时,其 xmin 值将设置为 INSERT 命令的事务 ID。

When a row is deleted, the xmax value of its current version is set to the transaction ID of the DELETE command.
删除一行时,其当前版本的max值被设置为DELETE命令的事务ID。

With a certain degree of abstraction, the UPDATE command can be regarded as two separate operations: DELETE and INSERT First, the xmax value of the current row version is set to the transaction ID of the UPDATE command. Then a new ver- sion of this row is created; its xmin value will be the same as the xmax value of the previous version.
有了一定程度的抽象,UPDATE命令可以被视为两个独立的操作:DELETE和INSERT。首先,当前行版本的xmax值被设置为UPDATE命令的事务ID。然后创建该行的新版本;其 xmin 值将与之前版本的 xmax 值相同。

Now let’s get down to some low-level details of different operations on tuples.
现在让我们来了解元组上不同操作的一些底层细节。

For these experiments, we will need a two-column table with an index created on one of the columns:
对于这些实验,我们创建一个两列的表,并在其中一列上创建索引:

=>  CREATE TABLE   t(
  id integer    GENERATED ALWAYS AS IDENTITY,
  s  text
);
=>  CREATE INDEX ON    t(s);

*Insert

Start a transaction and insert one row:
启动事务并插入一行:

=>  BEGIN;
=>  INSERT INTO   t(s)  VALUES   ('FOO');

Here is the current transaction ID:
这是当前的事务 ID:

=>  --   txid_current() before v.13
SELECT   pg_current_xact_id();
 pg_current_xact_id
-----------------------------
                   776
(1 row)

To denote the concept of a transaction, PostgreSQL uses the term xact, which can be found both in SQL function names and in the source code. Consequently, a transaction ID can be called xact ID, TXID, or simply XID. We are going to come across these abbreviations over and over again.
为了表示事务的概念,PostgreSQL 使用术语 xact,它可以在 SQL 函数名称和源代码中找到。因此,事务 ID 可以称为 xact ID、TXID 或简称为 XID。我们会一遍又一遍地遇到这些缩写。

Let’s take a look at the page contents. The heap_page_items function can give us all the required information, but it shows the data “as is,” so the output format is a bit hard to comprehend:
我们来看看页内容。 heap_page_items 函数可以为我们提供所有必需的信息,但它“按原样”显示数据,因此输出格式有点难以理解:

-- heap_page_items 在 pageinspect 扩展中
=>  SELECT  *
FROM  heap_page_items(get_raw_page('t',0))         \gx
-[ RECORD 1 ]-----------------------------
lp             | 1     -- 数据指针 行目录
lp_off         | 8160
lp_flags       | 1
lp_len         | 32
t_xmin         | 776   -- 插入事务的id
t_xmax         | 0
t_field3       | 0
t_ctid         | (0,1)
t_infomask2    | 2
t_infomask     | 2050
t_hoff         | 24
t_bits         |
t_oid          |
t_data         | \x0100000009464f4f

To make it more readable, we can leave out some information and expand a few columns:
为了使其更具可读性,我们可以省略一些信息并扩展几列:

=>  SELECT   '(0,'||lp||')'    AS ctid,
      CASE   lp_flags
         WHEN  0 THEN  'unused'
         WHEN  1 THEN  'normal'
         WHEN  2 THEN  'redirect to '||lp_off
         WHEN  3 THEN  'dead'
      END AS   state,
      t_xmin as xmin,
      t_xmax as xmax,
      (t_infomask & 256) > 0        AS xmin_committed,
      (t_infomask & 512) > 0        AS xmin_aborted,
      (t_infomask & 1024) > 0       AS xmax_committed,
      (t_infomask & 2048) > 0       AS xmax_aborted
FROM  heap_page_items(get_raw_page('t',0))         \gx
-[ RECORD 1 ]------+--------------------
ctid               | (0,1)   -- 第0 页第1个指针
state              | normal
xmin               | 776
xmax               | 0
xmin_committed     | f
xmin_aborted       | f
xmax_committed     | f
xmax_aborted       | t   --注意这里是true

This is what has been done here:
这是这里所做的:

  • The lp pointer is converted to the standard format of a tuple ID: (page number, pointer number).
    lp指针被转换为标准格式的元组 ID:(页号指针编号)。
  • The lp_flags state is spelled out. Here it is set to the normal value, which means that it really points to a tuple.
    lp_flags 状态已详细说明。这里设置为 normal,这意味着它确实指向一个元组
  • Of all the information bits, we have singled out just two pairs so far. The xmin_committed and xmin_aborted bits show whether the xmin transaction is committed or aborted. The xmax_committed and xmax_aborted bits give similar information about the xmax transaction.
    在所有信息位中,到目前为止我们只挑选出了两对。 xmin_commitedxmin_aborted 位 显示 xmin 事务是 已提交 还是 已中止。 xmax_commitedxmax_aborted 位提供有关 xmax 事务的类似信息。

The pageinspect extension provides the heap_tuple_infomask_flags function that explains all the information bits, but I am going to retrieve only those that are required at the moment, showing them in a more concise form.
pageinspect 扩展提供了 heap_tuple_infomask_flags 函数来解释所有信息位,但我将只检索当前需要的那些信息,以更简洁的形式显示它们。

Let’s get back to our experiment. The INSERT command has added pointer 1 to the heap page; it refers to the first tuple, which is currently the only one.
回到实验。 INSERT命令已将 指针1(指针号从1 开始,不是从0) 添加到堆页;它指的是第一个元组,目前是唯一的。

The xmin field of the tuple is set to the current transaction ID. This transaction is still active, so the xmin_committed and xmin_aborted bits are not set yet.
元组的 xmin 字段设置为当前事务 ID。该事务仍然处于活动状态(没有提交也没有回滚),因此 xmin_committedxmin_aborted 位 都尚未设置。

The xmax field contains 0, which is a dummy number showing that this tuple has not been deleted and represents the current version of the row. Transactions will ignore this number because the xmax_aborted bit is set.

xmax 字段为 0,这是一个虚拟数字,表明该元组尚未被删除,并标识 该行的当前版本。事务将忽略此数字,因为设置了 xmax_aborted位,该位的值为 true 。

It may seem strange that the bit corresponding to an aborted transaction is set for the transaction that has not happened yet. But there is no difference between such transac- tions from the isolation standpoint: an aborted transaction leaves no trace, hence it has never existed.
为尚未发生的事务设置与中止事务相对应的位可能看起来很奇怪。但从隔离的角度来看,此类事务之间没有区别:中止的事务不会留下任何痕迹,因此它从未存在过。

We will use this query more than once, so I am going to wrap it into a function. And while being at it, I will also make the output more concise by hiding the information bit columns and displaying the status of transactions together with their IDS.
以后将多次使用此查询,因此将其包装到一个函数中。同时,还将通过隐藏 信息位 列并显示事务状态及其 ID 来使输出更加简洁。

=>  CREATE FUNCTION  heap_page(relname text, pageno integer)
RETURNS TABLE(ctid tid, state text, xmin text, xmax text)
AS  $$
SELECT  (pageno,lp)::text::tid  AS  ctid,
       CASE lp_flags
         WHEN 0 THEN  'unused'
         WHEN 1 THEN  'normal'
         WHEN 2 THEN  'redirect to '||lp_off
         WHEN 3 THEN  'dead'
       END AS state,
       t_xmin || CASE
         WHEN (t_infomask & 256) > 0   THEN ' c'
         WHEN (t_infomask & 512) > 0   THEN ' a'
         ELSE ''
       END AS xmin,
      t_xmax ||    CASE
         WHEN  (t_infomask & 1024) > 0      THEN  ' c'
         WHEN  (t_infomask & 2048) > 0      THEN  ' a'
         ELSE  ''
      END AS   xmax
FROM  heap_page_items(get_raw_page(relname,pageno))
ORDER BY   lp;
$$  LANGUAGE   sql;

这个sql很重要,后边很多章节都用到它,一定要理解它的目的

Now it is much clearer what is happening in the tuple header:
现在元组头中发生的事情更加清楚了:

=>  SELECT   * FROM  heap_page('t',0);
  ctid   | state    | xmin | xmax
---------+----------+------+--------------------
  (0,1)  | normal   | 776  | 0 a    -- xmax 为0 ,xmax 状态为abort
(1 row)

You can get similar but less detailed information from the table itself by querying the xmin and xmax pseudocolumns:
您可以通过查询 xmin 和 xmax 伪列从表本身获取类似但不太详细的信息:

=>  SELECT   xmin, xmax, *    FROM  t;
  xmin | xmax | id |  s
-------+------+----+--------------------
   776 |     0|   1| FOO
(1 row)

Commit

Once a transaction has been completed successfully, its status has to be stored somehow—it must be registered that the transaction is committed. For this pur- pose, PostgreSQL employs a special CLOG (commit log) structure. It is stored as files in the PGDATA/pg_xact directory rather than as a system catalog table.
一旦事务成功完成,就必须以某种方式存储其状态 - 必须 记录事务已提交。为此,PostgreSQL 采用了一种特殊的 CLOG(提交日志 commit log)结构。提交日志作为文件存储在 PGDATA/pg_xact 目录中,而不是作为 系统目录表。

Previously, these files were located in PGDATA/pg_clog, but in version 10 this directory got renamed: it was not uncommon for database administrators unfamiliar with PostgreSQL to delete it in search of free disk space, thinking that a “log” is something unnecessary.
以前,这些文件位于 PGDATA/pg_clog 中,但在版本 10 中,该目录被重命名:不熟悉 PostgreSQL 的数据库管理员认为“日志”是不必要的,为了寻找可用磁盘空间而删除它的情况并不少见。

CLOG is split into several files solely for convenience. These files are accessed page by page via buffers in the server’s shared memory.
CLOG 被分成几个文件只是为了方便。这些文件缓存在 服务器共享内存中的缓冲区内被按页访问。

Just like a tuple header, CLOG contains two bits for each transaction: committed and aborted.
就像元组头一样,CLOG 为每个事务包含两个bit:committedaborted

Once committed, a transaction is marked in CLOG with the committed bit. When any other transaction accesses a heap page, it has to answer the question: has the xmin transaction already finished?
一旦提交,事务就会在 CLOG 中用 committed bit 进行标记。当任何其他事务访问堆页时,它必须回答这个问题:xmin 事务已经完成了吗(它还活动吗)?

  • If not, then the created tuple must not be visible. To check whether the transaction is still active, PostgreSQL uses yet another structure located in the shared memory of the instance; it is called ProcArray. This structure contains the list of all the active processes, with the correspond- ing current (active) transaction specified for each process.
    如果不是,则创建的元组一定不可见。为了检查事务是否仍然处于活动状态,PostgreSQL 使用位于 实例共享内存中 的另一个结构;它被称为 ProcArray。该结构包含所有活动进程的列表,以及为每个进程指定的相应 当前(活动)事务。(ProcArray 结构包含所有 活动会话及其事务的列表)
  • If yes, was it committed or aborted? In the latter case, the corresponding tuple cannot be visible either. It is this check that requires CLOG But even though the most recent CLOG pages are stored in memory buffers, it is still expensive to perform this check every time. Once determined, the transaction status is written into the tuple header—more specifically, into xmin_committed and xmin_aborted information bits, which are also called hint bits. If one of these bits is set, then the xmin transaction status is considered to be already known, and the next transaction will have to access neither CLOG nor ProcArray.
    如果是,是已提交还是已中止?在后一种情况下,相应的元组也不可见。正是这个检查需要CLOG,但是即使最近的CLOG页存储在内存缓冲区中,每次执行这个检查仍然是昂贵的。一旦确定,事务状态就会被写入元组头中,更具体地说,写入 xmin_committed 和 xmin_aborted 信息位,也称为 提示位。如果设置了其中一个位,则 xmin 事务状态被认为是已知的,并且下一个事务将不必访问 CLOG 或 ProcArray。

Why aren’t these bits set by the transaction that performs row insertion? The prob- lem is that it is not known yet at that time whether this transaction will complete successfully. And when it is committed, it is already unclear which tuples and pages have been changed. If a transaction affects many pages, it may be too expensive to track them. Besides, some of these pages may be not in the cache anymore; reading them again to simply update the hint bits would seriously slow down the commit.
为什么执行 行插入的事务不设置这些位(元组头中)?问题是,当时 尚不知道该事务是否会成功完成。当它被 提交时,已经不清楚哪些元组和页已被更改。如果事务影响了许多页,则跟踪它们的成本可能会太高。此外,其中一些页可能不再位于缓存中;再次读取它们以简单地更新提示位会严重减慢事务提交速度。

The flip side of this cost reduction is that any transaction (even aread-only SELECT command) can start setting hint bits, thus leaving a trail of dirtied pages in the buffer cache.
这种成本降低的另一面是任何事务(甚至是仅读取的 SELECT 命令)都可以开始设置 提示位,从而在缓冲区高速缓存中留下脏页的痕迹。

Finally, let’s commit the transaction started with the INSERT statement:
最后,让我们提交以 INSERT 语句开始的事务:

 => COMMIT;

Nothing has changed in the page (but we know that the transaction status has al- ready been written into CLOG):
页中没有任何变化(但我们知道事务状态已经写入 CLOG):

 => SELECT * FROM  heap_page('t',0);
  ctid  | state   | xmin | xmax
 -------+---------+------+--------------------
  (0,1) | normal  | 776  | 0 a   
 (1 row)

Now the first transaction that accesses the page (in a “standard” way, without using pageinspect) has to determine the status of the xmin transaction and update the hint bits:
现在,第一个访问该页的事务(以“标准”方式,不使用 pageinspect)必须确定 xmin 事务的状态并更新提示位:

 => SELECT * FROM  t;
  id |  s
 ----+--------------------
   1 | FOO
 (1 row)
 => SELECT * FROM  heap_page('t',0);
  ctid  | state  | xmin  | xmax
 -------+--------+-------+--------------------
  (0,1) | normal | 776 c | 0 a     -- 这里看到 776 c,c 就是COMMITTED
 (1 row)

事务的状态有以下4种。
·TRANSACTION_STATUS_IN_PROGRESS=0x00: 表示事务正在进行中。
·TRANSACTION_STATUS_COMMITTED=0x01: 表示事务已提交。
·TRANSACTION_STATUS_ABORTED=0x02: 表示事务已回滚。
·TRANSACTION_STATUS_SUB_COMMITTED=0x03: 表示子事务已提交。
实际上, CommitLog文件是一个位图文件, 因为事务有上述4种状态, 所以需要用两位来表示一个事务的状态。 理论上数据库最多记录20亿个事务, 所以CommitLog最多占用 512MB空间。 CommitLog也会被VACUUM清理, 而数据库中的参数“autovacuum_freeze_max_age”的默认设置为2亿, 这样AutoVacuum会尽力保证数据库的事务数是2亿个, 所以通常数据库的CommitLog占用的空间是51MB左右。
可能会有人有这样的疑问, 如果每读到一行时都需要判断这一行上的xmin和xmax代 表的事务是否已提交或回滚, 因而都去读CommitLog文件, 这样效率会不会很低? 实际上是不会的, 对此PostgreSQL做了以下优化:
·PostgreSQL对CommitLog文件进行了Cache, 即在共享内存中有clog buffer, 所以多数情况下不需要读取CommitLog文件。
·在每行上有一个标志字段t_infomask, 如果标志位“HEAP_XMIN_COMMITTED”被设置, 就知道xmin代表的事务已提交, 则不需要到CommitLog文件中去判断。 同样, 如果 “HEAP_XMAX_COMMITTED”被设置, 就知道xmin代表的事务已提交, 则不需要到CommitLog文件中去判断。
PostgreSQL数据库通过以上优化手段解决了读取行时判断事务状态效率低的问题。

Delete

When a row is deleted, the xmax field of its current version is set to the transaction ID that performs the deletion, and the xmax_aborted bit is unset.
当删除一行时,其当前版本的xmax字段被设置为执行删除的事务ID,并且xmax_aborted位被 取消设置(设置为false) 。

While this transaction is active, the xmax value serves as a row lock. If another transaction is going to update or delete this row, it will have to wait until the xmax transaction is complete.
当此事务处于活动状态时,xmax 值充当行锁。如果另一个事务要更新或删除该行,则必须等到 xmax 事务完成。

Let’s delete a row:

下边删除一行:

=>  BEGIN;
=>  DELETE FROM   t;
=>  SELECT  pg_current_xact_id();
 pg_current_xact_id
-----------------------------
                777
(1 row)

The transaction ID has already been written into the xmax field, but the informa- tion bits have not been set yet:
事务 ID 已被写入 xmax 字段,但信息位尚未设置:

=>  SELECT  *  FROM  heap_page('t',0);
 ctid  | state   | xmin    | xmax
-------+---------+---------+--------------------
 (0,1) | normal  | 776 c   | 777
(1 row)

Abort

The mechanism of aborting a transaction is similar to that of commit and happens just as fast, but instead of committed it sets the aborted bit in CLOG. Although the corresponding command is called ROLLBACK no actual data rollback is happening: all the changes made by the aborted transaction in data pages remain in place.

abort事务的机制与提交类似,发生速度也一样快,但它不是 提交,而是在 CLOG 中设置abort位。尽管相应的命令被称为 ROLLBACK,但并未发生实际的数据回滚:abort 事务在数据页中所做的所有更改仍然保留。

=>  ROLLBACK;  --回滚事务
=>  SELECT  *  FROM  heap_page('t',0);
 ctid  | state  | xmin  | xmax
-------+--------+-------+--------------------
 (0,1) | normal | 776 c | 777
(1 row)

When the page is accessed, the transaction status is checked, and the tuple receives the xmax_aborted hint bit. The xmax number itself still remains in the page, but no one is going to pay attention to it anymore:
当访问该页时,会检查事务状态,并且元组会收到 xmax_aborted 提示位。 xmax 数字本身仍然保留在页中,但没有人会再关注它:

 => SELECT * FROM  t;
  id |  s
 ----+--------------------
   1 | FOO
 (1 row)
 => SELECT * FROM  heap_page('t',0);
  ctid  | state  | xmin  | xmax
 -------+--------+-------+---------
  (0,1) | normal | 776 c | 777 a   --777 a,代表777 已经aborted
 (1 row)

Update

An update is performed in such a way as if the current tuple is deleted, and then a new one is inserted:
更新的执行方式就好像删除当前元组,然后插入一个新元组:

 => BEGIN;
 => UPDATE t SET  s = 'BAR';
 => SELECT pg_current_xact_id();
  pg_current_xact_id
 -----------------------------
                  778

 (1 row)

The query returns a single row (its new version):

下边的查询返回单个行(这是它的最新版本):

=> SELECT * FROM  t;
  id |  s
 ----+--------------------
   1 | BAR
 (1 row)

But the page keeps both versions:
该页保留了两个行版本

=>  SELECT * FROM  heap_page('t',0);
 ctid  | state  | xmin   | xmax
-------+--------+--------+--------------------
 (0,1) | normal | 776 c  | 778    -- 已经被update删除的版本
 (0,2) | normal | 778    | 0 a    -- 最新插入的版本
(2 rows)

The xmax field of the previously deleted version contains the current transaction ID. This value is written on top of the old one because the previous transaction was aborted. The xmax_aborted bit is unset since the status of the current transaction is still unknown.
之前删除的版本的xmax字段包含当前的事务ID。该值写入旧值之上,因为上一事务已中止。xmax_aborted 位 未设置 由于当前事务的状态仍然未知 。

To complete this experiment, let’s commit the transaction.
为了完成这个实验,让我们提交事务。

=>  COMMIT;

3.4 Indexes 索引

Regardless of their type, indexes do not use row versioning; each row is represented by exactly one tuple. In other words, index row headers do not contain xmin and xmax fields. Index entries point to all the versions of the corresponding. table row To figure out which row version is visible, transactions have to access thetable (unless the required page appears in the visibility map). For convenience, let’s create a simple function that will use pageinspect to display all the index entries in the page (B-tree index pages store them as a flat list):
无论任何类型的索引,都不使用行版本控制;每一行都由一个元组表示。换句话说,索引行头 不包含 xmin 和 xmax 字段。索引条目指向对应的所有版本。 table 中的行 要确定哪个行版本是可见的,事务必须 回表访问(除非所需的页出现在可见性映射中)。为了方便起见,让我们创建一个简单的函数,它将使用 pageinspect 显示页中的所有索引条目(B 树索引页将它们存储为 flat list):

=>  CREATE FUNCTION  index_page(relname text, pageno integer)
RETURNS TABLE(itemoffset smallint, htid tid)
AS  $$
SELECT  itemoffset,
        htid -- ctid before v.13
FROM  bt_page_items(relname,pageno);
$$  LANGUAGE sql;

The page references both heap tuples, the current and the previous one:
该页引用了两个堆元组,当前元组和前一个元组:

 -- 上边先插入FOO,后来FOO 更行成 BAR。 FOO 的数据 实际不可见,但在索引中还存在
 => SELECT * FROM  index_page('t_s_idx',1);
  itemoffset | htid
 ------------+--------------------
           1 | (0,2)    -- 页0 第2个
           2 | (0,1)    -- 页0 第1个
 (2 rows)

Since BAR<FOO, the pointer to the second tuple comes first in the index.
由于 BAR<FOO, 第二个元组的指针 排在索引的第一个(因为创建索引前需要先排序)。

3.5 TOAST 超大属性存储技术

A TOAST table is virtually a regular table, and it has its own versioning that does not depend on row versions of the main table. However, rows of TOAST tables are handled in such a way that they are never updated; they can be either added or deleted, so their versioning is somewhat artificial. TOAST 表实际上是一个常规表,并且它有自己的版本控制 不依赖于主表的行版本。然而,TOAST 表的行的处理方式是永远不会更新;它们只可以添加或删除,因此它们的版本控制有些人为。

Each data modification results in creation of a new tuple in the main table. But if an update does not affect any long values stored in TOAST the new tuple will reference an existing toasted value. Only when a long value gets updated will PostgreSQL create both a new tuple in the main table and new “toasts.”
每次数据修改都会在主表中创建一个新元组。 但是如果 更新不会影响 TOAST 中存储的任何长值,新元组将引用现有的 toast 值。只有当 长值 被更新时,PostgreSQL 才会在主表中创建一个新的元组和新的“toasts”。

3.6 Virtual Transactions

To consume transaction IDS sparingly, PostgreSQL offers a special optimization.
为了节省事务 ID 的消耗,PostgreSQL 提供了特殊的优化。

If a transaction is read-only, it does not affect row visibility in any way. That’s why such a transaction is given a virtual XID at first, which consists of the backend process ID and a sequential number. Assigning a virtual XID does not require any synchronization between different processes, so it happens very fast. At this point, the transaction has no real ID yet: 如果事务是只读的,则它不会以任何方式影响行可见性。这就是为什么这样的事务首先被赋予一个虚拟的 XID,它由 后端进程ID 和 一个序号 组成。分配虚拟 XID 不需要不同进程之间的任何同步,因此速度非常快。此时,该事务还没有真实的ID:

=> BEGIN;
=>  --  txid_current_if_assigned() before v.13
SELECT  pg_current_xact_id_if_assigned();
 pg_current_xact_id_if_assigned
-----------------------------------------
(1 row)

At different points in time, the system can contain some virtual XIDS that have already been used. And it is perfectly normal: virtual XIDS exist only in RAM, and only while the corresponding transactions are active; they are never written into data pages and never get to disk.
在不同的时间点,系统可以包含一些已经使用过的虚拟XID。这是完全正常的:虚拟 XID 仅存在于 RAM 中,并且仅当相应的事务处于活动状态时才存在;它们永远不会写入数据页,也永远不会保存到磁盘。

Once the transaction starts modifying data, it receives an actual unique ID:
一旦事务开始修改数据,它就会生成一个实际的唯一 ID:

=>  UPDATE accounts
SET  amount = amount - 1.00;
=>  SELECT pg_current_xact_id_if_assigned();
 pg_current_xact_id_if_assigned
----------------------------------------
                              780
(1 row)
=>  COMMIT;

3.7 Subtransactions 子事务

Savepoints

SQL supports savepoints, which enable canceling some of the operations within a transaction without aborting this transaction as a whole. But such a scenario does not fit the course of action described above: the status of a transaction appliesto all its operations, and no physical data rollback is performed.
SQL 支持 保存点,可以 取消事务中的某些操作,而不会中止整个事务。但这样的场景并不符合上述的操作过程:事务的状态适用于其所有操作,并且不执行物理数据回滚。

To implement this functionality, a transaction containing a savepoint is split into several subtransactions, so their status can be managed separately.
为了实现此功能,包含保存点的事务被分为多个子事务,因此可以单独管理它们的状态。

Subtransactions have their own IDS (which are bigger than the ID of the main trans- action). The status of a subtransaction is written into CLOG in the usual manner; however, committed subtransactions receive both the committed and the aborted bits at once. The final decision depends on the status of the main transaction: ifit is aborted, all its subtransactions will be considered aborted too.
子事务有自己的ID(比主事务的ID大)。子事务的状态 以通常的方式写入CLOG;然而,已提交的子事务同时接收已提交位和已中止位。最终决定取决于主事务的状态:如果主事务被中止,则其所有子事务也将被视为中止。

The information about subtransactions is stored under the PGDATA/pg_subtrans di- rectory. File access is arranged via buffers that are located in the instance’s shared memory and have the same structure as CLOG buffers.
有关子事务的信息存储在 PGDATA/pg_subtrans 目录下。文件访问通过位于实例共享内存中的缓冲区进行安排,并且具有与 CLOG 缓冲区相同的结构。

Do not confuse subtransactions with autonomous ones. Unlike subtransactions, the latter do not depend on each other in any way. Vanilla PostgreSQL does not support autonomous transactions, and it is probably for the best: they are required in very rare cases, but their availability in other database systems often provokes misuse, which can cause a lot of trouble.
不要将子事务与自治事务混淆。与子事务不同,后者不以任何方式相互依赖。 Vanilla PostgreSQL 不支持自治事务,这可能是最好的:在极少数情况下需要它们,但它们在其他数据库系统中的可用性经常会引起误用,这可能会导致很多麻烦。

Let’s truncate the table, start a new transaction, and insert a row:
让我们截断表,启动一个新事务,并插入一行:

 => TRUNCATE TABLE  t;
 => BEGIN;
 => INSERT INTO  t(s) VALUES ('FOO');
 => SELECT pg_current_xact_id();
  pg_current_xact_id
 -----------------------------
                  782

 (1 row)

Now create a savepoint and insert another row:
现在创建一个保存点并插入另一行:

 => SAVEPOINT sp;
 => INSERT INTO  t(s) VALUES ('XYZ');
 => SELECT pg_current_xact_id();
  pg_current_xact_id
 -----------------------------
                  782
 (1 row)

Note that the pg_current_xact_id function returns the ID of the main transaction, not that of a subtransaction.
请注意,pg_current_xact_id 函数返回主事务的 ID,而不是子事务的 ID。

=>  SELECT  *
FROM  heap_page('t',0) p
  LEFT JOIN    t ON  p.ctid = t.ctid;
 ctid  | state  | xmin | xmax | id | s
-------+--------+------+------+----+-------------
 (0,1) | normal | 782  | 0 a  |  2 | FOO
 (0,2) | normal | 783  | 0 a  |  3 | XYZ
(2 rows)

Let’s roll back to the savepoint and insert the third row:
让我们回滚到保存点并插入第三行:

=>  ROLLBACK TO   sp;
=>  INSERT INTO   t(s)  VALUES   ('BAR');
=>  SELECT  *
FROM  heap_page('t',0) p
  LEFT JOIN    t ON  p.ctid = t.ctid;
 ctid  | state  | xmin | xmax | id  |    s
-------+--------+------+------+-----+------------
 (0,1) | normal | 782  | 0 a  |   2 | FOO
 (0,2) | normal | 783  | 0 a  |     |
 (0,3) | normal | 784  | 0 a  |   4 | BAR
(3 rows)

The page still contains the row added by the aborted subtransaction.
该页仍然包含由 已中止的子事务 添加的行。

Commit the changes:

提交更改:

=>  COMMIT;
=>  SELECT  *  FROM  t;
 id |    s
--------+--------------------
  2 | FOO
  4 | BAR
(2 rows)
=>  SELECT  *  FROM  heap_page('t',0);
 ctid  | state  | xmin  | xmax
-------+--------+-------+-----------
 (0,1) | normal | 782 c | 0 a
 (0,2) | normal | 783 a | 0 a
 (0,3) | normal | 784 c | 0 a
(3 rows)

Now we can clearly see that each subtransaction has its own status.
现在我们可以清楚地看到每个子事务都有自己的状态。

SQL does not allow using subtransactions directly, that is, you cannot start a new transaction before completing the current one:
SQL不允许直接使用子事务,也就是说,在完成当前事务之前不能启动新事务:

 => BEGIN;
 BEGIN
 => BEGIN;
 WARNING:  there is already a transaction in progress
 BEGIN
 => COMMIT;
 COMMIT
 => COMMIT;
 WARNING:  there is no transaction in progress
 COMMIT

Subtransactions are employed implicitly: to implement savepoints, handle excep- tions in PL/pgSQL, and in some other, more exotic cases.
子事务被隐式地使用:实现保存点、处理 PL/pgSQL 中的异常以及其他一些更奇特的情况。

Errors and Atomicity 错误和原子性

What happens if an error occurs during execution of a statement?
如果语句执行过程中出现错误怎么办?

 => BEGIN;
 => SELECT * FROM  t;
  id |  s
 ----+--------------
   2 | FOO
   4 | BAR
 (2 rows)
 => UPDATE t SET  s = repeat('X', 1/(id-4));
 ERROR:  division by zero

After a failure, the whole transaction is considered aborted and cannot perform any further operations:
失败后,整个事务被视为中止,无法执行任何进一步的操作:

 => SELECT * FROM  t;
 ERROR:  current transaction is aborted, commands ignored until end
 of transaction block

And even if you try to commit the changes, PostgreSQL will report that the trans- action is rolled back:
即使您尝试提交更改,PostgreSQL 也会报告事务已回滚:

=>  COMMIT;
ROLLBACK

Why is it forbidden to continue transaction execution after a failure? Since the already executed operations are never rolled back, we would get access to some changes made before the error—it would break the atomicity of the statement, and hence that of the transaction itself.
为什么事务失败后禁止继续执行?由于已经执行的操作永远不会回滚,因此我们可以访问错误之前所做的一些更改 - 这会破坏语句的原子性,从而破坏事务本身的原子性。

For example, in our experiment the operator has managed to update one of the two rows before the failure:
例如,在我们的实验中,操作 已成功更新失败前的两行之一:

=>  SELECT * FROM  heap_page('t',0);
 ctid  | state  | xmin  | xmax
-------+--------+-------+--------------------
 (0,1) | normal | 782 c | 785
 (0,2) | normal | 783 a | 0 a
 (0,3) | normal | 784 c | 0 a
 (0,4) | normal | 785   | 0 a
(4 rows)

On a side note, psql provides a special mode that allows you to continue a transac- tion after a failure as if the erroneous statement were rolled back:
附带说明一下,psql 提供了一种特殊模式,允许您在失败后继续事务,就像回滚错误语句一样:

=>  \set ON_ERROR_ROLLBACK on
=>  BEGIN;
=>  UPDATE t SET s = repeat('X', 1/(id-4));
ERROR:   division by zero
=>  SELECT * FROM  t;
 id  |   s
-----+--------------------
   2 | FOO
   4 | BAR
(2 rows)
=>  COMMIT;
COMMIT

As you can guess, psql simply adds an implicit savepoint before each command when run in this mode; in case of a failure, a rollback is initiated. This mode is not used by default because issuing savepoints (even if they are not rolled back to) incurs significant overhead.
正如您所猜测的,在此模式下运行时,psql 只是在每个命令之前添加一个隐式保存点;如果出现故障,则会启动回滚。默认情况下不使用此模式,因为 发出 保存点(即使它们没有回滚)会产生大量开销。


Postgres 的存储单位称为页,它们具有固定大小(默认为 8 kB)。固定页面大小为 Postgres 带来了许多优势,即数据管理简单、高效和一致性,但它也有一个缺点:某些数据值超长可能不适合该页面。


相关参考文档

怎么利用数据对齐:

一般经验法则是,只要有可能,就根据数据类型 大小的降序定义列,大的在前,小的在后

数据对齐会浪费存储空间,但是为什么要数据对齐?

正确对齐带来的加速发生在数据被检索后的处理过程中(内存中处理快,数据保存为元组需要解码),而不是在从磁盘检索数据时。这是一个微妙但重要的区别,值得强调。

为什么分页:

  1. 减少外部碎片

  2. 简化寻址, 在一个段文件中,你会发现这是由多个固定大小的页面拼接在一起。默认情况下,一个页面的大小为 8 KB,所以我们可以得知所有对象文件都是 8 KB 的倍数。在这个例子中,表文件大小为 32 KB,这意味着其中有 4 个页面。

  3. 之所以这样设计,是因为磁盘的读写速度太慢了,尤其是随机读写,通过增加单次的吞吐量,来提高读写性能。

你可能会想为什么使用页?为什么不是每个段只有一个页?答案是每个页面以一个原子操作写入,页面越大,写入过程中出现写入失败的可能性越大。页面越大,数据库性能越好,但页面越大,写入失败的可能性也越大。Postgres 维护者选择 8 KB 作为默认值,他们很清楚自己在做什么,所以通常没有理由改变这个值。

heap表是无序存储的, 数据插入heap表时根据fsm找到空闲空间满足行大小的位置插入. 更新时tuple新版本也可能存储到和old tuple不同的block内(非HOT的更新情况).

无序存储的好处是插入快, 因为不用像有序的索引那样选择固定的目标page、分裂Page、修改page的link信息等, 当快速并发插入排序字段value相近的数据时也不会因为热点page影响性能.

存储在heap表里的记录可以根据“数据块ID+块内line point”进行寻址.

cluster 就是将数据根据某个索引值的顺序重新排列并存入heap blocks中.

当按cluster 索引检索一个区间的数据时, 回表访问的block由于挨着所以IO是连续的, 可以减少IO.

特别是需要访问很多条记录, 并且这些记录散落在很多个数据块里面时, cluster 后的IO量会迅速降低.

cluster 是一次性操作, 如果数据发生更新, 新的tuple可能会插入到其他heap block, 打乱原来的顺序. 新insert的记录也不会插入到你想要的顺序中.

所以cluster 不适合更新/插入频繁的表. 更新可以通过设置fillfactor让new tuple插入当前block, 但是insert无法解决.


在 Postgres 中,一个 page 默认的大小是 8KB,默认情况下一个文件的大小是 1GB,所以能够最多容纳 131072 个 page。

如果我们采用一个 32 位 int 类型来表示一个 page 的空闲空间的话,当然是没问题的。但是如果 page 很多的话,每个 page 都需要 32 位来表示空闲空间的值。

FSM 也是需要物理存储的,为了在搜索的时候,能够更加快速,我们需要保证 FSM 占用的空间尽可能的少,所以在 Postgres 中采用了分类别的方式,将空闲空间的大小以 32 为步长,分为了 256 个区间

一个FSM文件里面有好多个FSM页,它们(FSM页)组成一棵树。每个FSM页里都是 完全二叉树(postgresql 使用数组来表示完全二叉树),

每个叶节点表示一个堆页面或者一个下层FSM页面。整个FSM结构大概长这样

每个黑框框都是一个FSM页,一个FSM文件就由这些FSM页组成,左上角的红色数字是FSM页的页号,和FSM页的磁盘布局有关,后面会说 所有FSM页构成一颗大树,这个树一般是三层或四层。最底层是0层 这颗数是几叉树和一个FSM页里的叶子节点数量有关,看图大概能看出来 不管上层还是底层的FSM页,里面都是个二叉树结构,父节点是两个孩子的最大值 底层FSM页内的叶子节点记录了数据页的空闲空间大小,上层FSM页内的叶子节点记录的是它对应第几个孩子页的根节点值

FSM被组织成一课FSM页的树,底层的FSM页存储了每一个 heap (or index) page 中可用的空闲空间,每个页对应一个字节。上层FSM页面则聚集来自于下层页面的信息 每个FSM页里是一个数组表示的二叉树,每个节点一个字节。每个叶节点表示一个堆页面或者一个下层FSM页面。在每一个非叶节点中存储了它孩子节点中的最大值

原文链接:https://blog.csdn.net/weixin_45943953/article/details/127391954

postgresql 并没有使用跨页存储的方案,而是将大型数据单独放到其余地方存储。这样在条件过滤时,会比较好,因为它不需要读取这些大的数据,而且只有当该列被选中时,才会在返回数据时去读取。这种场景下,减少了磁盘 IO 的读取,提升了性能。

同样它也有对应的缺点,那就是写入大型的数据时,会比较慢。因为它需要切片,然后插入到 toast 表中,还要更新 toast 表的索引。如果采用跨页存储,那么还可以利用磁盘顺序写的高性能。在读取整行数据时候,还需要先去寻找 toast 表的索引,然后再去读取 toast 表的数据,相比较跨页存储,仍然无法使用磁盘顺序读的高性能。

Postgresql Page 结构 | 学习笔记

# 行存之heap 表

postgres外存管理之页面的layout及操作

ctid 应用场景

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
其他
1
https://gitee.com/xuejianxinokok/mysqldoc.git
git@gitee.com:xuejianxinokok/mysqldoc.git
xuejianxinokok
mysqldoc
mysqldoc
master

搜索帮助