PostgreSQL 原始碼解讀(109)- WAL#5(相關資料結構)
本節簡單介紹了WAL相關的資料結構,包括XLogLongPageHeaderData、XLogPageHeaderData和XLogRecord。
一、資料結構
XLogPageHeaderData
每一個事務日誌檔案(WAL segment file)的page(大小預設為8K)都有頭部資料.
注:每個檔案第一個page的頭部資料是XLogLongPageHeaderData(詳見後續描述),而不是XLogPageHeaderData
/* * Each page of XLOG file has a header like this: * 每一個事務日誌檔案的page都有頭部資訊,結構如下: */ //可作為WAL版本資訊 #define XLOG_PAGE_MAGIC 0xD098/* can be used as WAL version indicator */ typedef struct XLogPageHeaderData { //WAL版本資訊,PG V11.1 --> 0xD98 uint16xlp_magic;/* magic value for correctness checks */ //標記位(詳見下面說明) uint16xlp_info;/* flag bits, see below */ //page中第一個XLOG Record的TimeLineID,型別為uint32 TimeLineIDxlp_tli;/* TimeLineID of first record on page */ //page的XLOG地址(在事務日誌中的偏移),型別為uint64 XLogRecPtrxlp_pageaddr;/* XLOG address of this page */ /* * When there is not enough space on current page for whole record, we * continue on the next page.xlp_rem_len is the number of bytes * remaining from a previous page. * 如果當前頁的空間不足以儲存整個XLOG Record,在下一個頁面中儲存餘下的資料 * xlp_rem_len表示上一頁XLOG Record剩餘部分的大小 * * Note that xl_rem_len includes backup-block data; that is, it tracks * xl_tot_len not xl_len in the initial header.Also note that the * continuation data isn't necessarily aligned. * 注意xl_rem_len包含backup-block data(full-page-write); * 也就是說在初始的頭部資訊中跟蹤的是xl_tot_len而不是xl_len. * 另外要注意的是剩餘的資料不需要對齊. */ //上一頁空間不夠儲存XLOG Record,該Record在本頁繼續儲存佔用的空間大小 uint32xlp_rem_len;/* total len of remaining data for record */ } XLogPageHeaderData; #define SizeOfXLogShortPHDMAXALIGN(sizeof(XLogPageHeaderData)) typedef XLogPageHeaderData *XLogPageHeader;
XLogLongPageHeaderData
如設定了XLP_LONG_HEADER標記,在page header中儲存額外的欄位.
(通常在每個事務日誌檔案也就是segment file的的第一個page中存在).
這些附加的欄位用於準確的識別檔案。
/* * When the XLP_LONG_HEADER flag is set, we store additional fields in the * page header.(This is ordinarily done just in the first page of an * XLOG file.)The additional fields serve to identify the file accurately. * 如設定了XLP_LONG_HEADER標記,在page header中儲存額外的欄位. * (通常在每個事務日誌檔案也就是segment file的的第一個page中存在). * 附加欄位用於準確識別檔案。 */ typedef struct XLogLongPageHeaderData { //標準的頭部域欄位 XLogPageHeaderData std;/* standard header fields */ //pg_control中的系統標識碼 uint64xlp_sysid;/* system identifier from pg_control */ //交叉檢查 uint32xlp_seg_size;/* just as a cross-check */ //交叉檢查 uint32xlp_xlog_blcksz;/* just as a cross-check */ } XLogLongPageHeaderData; #define SizeOfXLogLongPHDMAXALIGN(sizeof(XLogLongPageHeaderData)) //指標 typedef XLogLongPageHeaderData *XLogLongPageHeader; /* When record crosses page boundary, set this flag in new page's header */ //如果XLOG Record跨越page邊界,在新page header中設定該標誌位 #define XLP_FIRST_IS_CONTRECORD0x0001 //該標誌位標明是"long"頁頭 /* This flag indicates a "long" page header */ #define XLP_LONG_HEADER0x0002 /* This flag indicates backup blocks starting in this page are optional */ //該標誌位標明從該頁起始的backup blocks是可選的(不一定存在) #define XLP_BKP_REMOVABLE0x0004 //xlp_info中所有定義的標誌位(用於page header的有效性檢查) /* All defined flag bits in xlp_info (used for validity checking of header) */ #define XLP_ALL_FLAGS0x0007 #define XLogPageHeaderSize(hdr)\ (((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
XLogRecord
事務日誌檔案由N個的XLog Record組成,邏輯上對應XLOG Record這一概念的資料結構是XLogRecord.
XLOG Record的整體佈局如下:
頭部資料(固定大小的XLogRecord結構體)
XLogRecordBlockHeader 結構體
XLogRecordBlockHeader 結構體
...
XLogRecordDataHeader[Short|Long] 結構體
block data
block data
...
main data
XLOG Record按儲存的資料內容來劃分,大體可以分為三類:
1.Record for backup block:儲存full-write-page的block,這種型別Record的目的是為了解決page部分寫的問題;
2.Record for (tuple)data block:在full-write-page後,相應的page中的tuple變更,使用這種型別的Record記錄;
3.Record for Checkpoint:在checkpoint發生時,在事務日誌檔案中記錄checkpoint資訊(其中包括Redo point).
XLOG Record的詳細解析後續會解析,這裡暫且不提
/* * The overall layout of an XLOG record is: *Fixed-size header (XLogRecord struct) *XLogRecordBlockHeader struct *XLogRecordBlockHeader struct *... *XLogRecordDataHeader[Short|Long] struct *block data *block data *... *main data * XLOG record的整體佈局如下: *固定大小的頭部(XLogRecord 結構體) *XLogRecordBlockHeader 結構體 *XLogRecordBlockHeader 結構體 *... *XLogRecordDataHeader[Short|Long] 結構體 *block data *block data *... *main data * * There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of * rmgr-specific data not associated with a block.XLogRecord structs * always start on MAXALIGN boundaries in the WAL files, but the rest of * the fields are not aligned. * 其中,XLogRecordBlockHeaders可能有0或者多個,與block無關的0或多個位元組的rmgr-specific資料 * XLogRecord通常在WAL檔案的MAXALIGN邊界起寫入,但後續的欄位並沒有對齊 * * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's * used to distinguish between block references, and the main data structs. * XLogRecordBlockHeader/XLogRecordDataHeaderShort/XLogRecordDataHeaderLong開頭是佔用1個位元組的"id". * 用於區分block依賴和main data結構體. */ typedef struct XLogRecord { //record的大小 uint32xl_tot_len;/* total len of entire record */ //xact id TransactionId xl_xid;/* xact id */ //指向log中的前一條記錄 XLogRecPtrxl_prev;/* ptr to previous record in log */ //標識位,詳見下面的說明 uint8xl_info;/* flag bits, see below */ //該記錄的資源管理器 RmgrIdxl_rmid;/* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ //2個位元組的crc校驗位,初始化為0 pg_crc32cxl_crc;/* CRC for this record */ /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */ //接下來是XLogRecordBlockHeaders和XLogRecordDataHeader } XLogRecord; //巨集定義:XLogRecord大小 #define SizeOfXLogRecord(offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c)) /* * The high 4 bits in xl_info may be used freely by rmgr. The * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by * XLogInsert caller. The rest are set internally by XLogInsert. * xl_info的高4位由rmgr自由使用. * XLR_SPECIAL_REL_UPDATE和XLR_CHECK_CONSISTENCY由XLogInsert函式的呼叫者傳入. * 其餘由XLogInsert內部使用. */ #define XLR_INFO_MASK0x0F #define XLR_RMGR_INFO_MASK0xF0 /* * If a WAL record modifies any relation files, in ways not covered by the * usual block references, this flag is set. This is not used for anything * by PostgreSQL itself, but it allows external tools that read WAL and keep * track of modified blocks to recognize such special record types. * 如果WAL記錄使用特殊的方式(不涉及通常塊引用)更新了關係的儲存檔案,設定此標記. * PostgreSQL本身並不使用這種方法,但它允許外部工具讀取WAL並跟蹤修改後的塊, *以識別這種特殊的記錄型別。 */ #define XLR_SPECIAL_REL_UPDATE0x01 /* * Enforces consistency checks of replayed WAL at recovery. If enabled, * each record will log a full-page write for each block modified by the * record and will reuse it afterwards for consistency checks. The caller * of XLogInsert can use this value if necessary, but if * wal_consistency_checking is enabled for a rmgr this is set unconditionally. * 在恢復時強制執行一致性檢查. * 如啟用此功能,每個記錄將為記錄修改的每個塊記錄一個完整的頁面寫操作,並在以後重用它進行一致性檢查。 * 在需要時,XLogInsert的呼叫者可使用此標記,但如果rmgr啟用了wal_consistency_checking, *則會無條件執行一致性檢查. */ #define XLR_CHECK_CONSISTENCY0x02 /* * Header info for block data appended to an XLOG record. * 追加到XLOG record中block data的頭部資訊 * * 'data_length' is the length of the rmgr-specific payload data associated * with this block. It does not include the possible full page image, nor * XLogRecordBlockHeader struct itself. * 'data_length'是與此塊關聯的rmgr特定payload data的長度。 * 它不包括可能的full page image,也不包括XLogRecordBlockHeader結構體本身。 * * Note that we don't attempt to align the XLogRecordBlockHeader struct! * So, the struct must be copied to aligned local storage before use. * 注意:我們不打算嘗試對齊XLogRecordBlockHeader結構體! * 因此,在使用前,XLogRecordBlockHeader必須拷貝到一隊齊的本地儲存中. */ typedef struct XLogRecordBlockHeader { //塊引用ID uint8id;/* block reference ID */ //在關係中使用的fork和flags uint8fork_flags;/* fork within the relation, and flags */ //payload位元組大小 uint16data_length;/* number of payload bytes (not including page * image) */ /* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */ //如BKPBLOCK_HAS_IMAGE,後續為XLogRecordBlockImageHeader結構體 /* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */ //如BKPBLOCK_SAME_REL沒有設定,則為RelFileNode /* BlockNumber follows */ //後續為BlockNumber } XLogRecordBlockHeader; #define SizeOfXLogRecordBlockHeader (offsetof(XLogRecordBlockHeader, data_length) + sizeof(uint16)) /* * Additional header information when a full-page image is included * (i.e. when BKPBLOCK_HAS_IMAGE is set). * 當包含完整頁影象時(即當設定BKPBLOCK_HAS_IMAGE時),附加的頭部資訊。 * * The XLOG code is aware that PG data pages usually contain an unused "hole" * in the middle, which contains only zero bytes.Since we know that the * "hole" is all zeros, we remove it from the stored data (and it's not counted * in the XLOG record's CRC, either).Hence, the amount of block data actually * present is (BLCKSZ - <length of "hole" bytes>). * XLOG程式碼知道PG資料頁通常在中間包含一個未使用的“hole”(空閒空間), *大小為零位元組。 * 因為我們知道“hole”都是零, *以我們從儲存的資料中刪除它(而且它也沒有被計入XLOG記錄的CRC中)。 * 因此,實際呈現的塊資料量為(BLCKSZ - <“hole”的大小>)。 * * Additionally, when wal_compression is enabled, we will try to compress full * page images using the PGLZ compression algorithm, after removing the "hole". * This can reduce the WAL volume, but at some extra cost of CPU spent * on the compression during WAL logging. In this case, since the "hole" * length cannot be calculated by subtracting the number of page image bytes * from BLCKSZ, basically it needs to be stored as an extra information. * But when no "hole" exists, we can assume that the "hole" length is zero * and no such an extra information needs to be stored. Note that * the original version of page image is stored in WAL instead of the * compressed one if the number of bytes saved by compression is less than * the length of extra information. Hence, when a page image is successfully * compressed, the amount of block data actually present is less than * BLCKSZ - the length of "hole" bytes - the length of extra information. * 另外,在啟用wal_compression時,會在去掉“hole”後,嘗試使用PGLZ壓縮演算法壓縮full page image。 * 這可以簡化WAL大小,但會增加額外的解壓縮CPU時間. * 在這種情況下,由於“hole”的長度不能通過從BLCKSZ中減去page image位元組數來計算, *所以它基本上需要作為額外的資訊來儲存。 * 但如果"hole"不存在,我們可以假設"hole"的大小為0,不需要儲存額外的資訊. * 請注意,如果壓縮節省的位元組數小於額外資訊的長度, *那麼page image的原始版本儲存在WAL中,而不是壓縮後的版本。 * 因此,當一個page image被成功壓縮時, *實際的塊資料量小於BLCKSZ - “hole”的大小 - 額外資訊的大小。 */ typedef struct XLogRecordBlockImageHeader { uint16length;/* number of page image bytes */ uint16hole_offset;/* number of bytes before "hole" */ uint8bimg_info;/* flag bits, see below */ /* * If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED, an * XLogRecordBlockCompressHeader struct follows. * 如標記BKPIMAGE_HAS_HOLE和BKPIMAGE_IS_COMPRESSED設定,則後跟XLogRecordBlockCompressHeader */ } XLogRecordBlockImageHeader; #define SizeOfXLogRecordBlockImageHeader\ (offsetof(XLogRecordBlockImageHeader, bimg_info) + sizeof(uint8)) /* Information stored in bimg_info */ //------------ bimg_info標記位 //存在"hole" #define BKPIMAGE_HAS_HOLE0x01/* page image has "hole" */ //壓縮儲存 #define BKPIMAGE_IS_COMPRESSED0x02/* page image is compressed */ //在回放時,page image需要恢復 #define BKPIMAGE_APPLY0x04/* page image should be restored during * replay */ /* * Extra header information used when page image has "hole" and * is compressed. * page image存在"hole"和壓縮儲存時,額外的頭部資訊 */ typedef struct XLogRecordBlockCompressHeader { //"hole"的大小 uint16hole_length;/* number of bytes in "hole" */ } XLogRecordBlockCompressHeader; #define SizeOfXLogRecordBlockCompressHeader \ sizeof(XLogRecordBlockCompressHeader) /* * Maximum size of the header for a block reference. This is used to size a * temporary buffer for constructing the header. * 塊引用的header的最大大小。 * 它用於設定用於構造頭部臨時緩衝區的大小。 */ #define MaxSizeOfXLogRecordBlockHeader \ (SizeOfXLogRecordBlockHeader + \ SizeOfXLogRecordBlockImageHeader + \ SizeOfXLogRecordBlockCompressHeader + \ sizeof(RelFileNode) + \ sizeof(BlockNumber)) /* * The fork number fits in the lower 4 bits in the fork_flags field. The upper * bits are used for flags. * fork號適合於fork_flags欄位的低4位。 * 高4位用於標記。 */ #define BKPBLOCK_FORK_MASK0x0F #define BKPBLOCK_FLAG_MASK0xF0 //塊資料是XLogRecordBlockImage #define BKPBLOCK_HAS_IMAGE0x10/* block data is an XLogRecordBlockImage */ #define BKPBLOCK_HAS_DATA0x20 //重做時重新初始化page #define BKPBLOCK_WILL_INIT0x40/* redo will re-init the page */ //重做時重新初始化page,但會省略RelFileNode #define BKPBLOCK_SAME_REL0x80/* RelFileNode omitted, same as previous */ /* * XLogRecordDataHeaderShort/Long are used for the "main data" portion of * the record. If the length of the data is less than 256 bytes, the short * form is used, with a single byte to hold the length. Otherwise the long * form is used. * XLogRecordDataHeaderShort/Long用於記錄的“main data”部分。 * 如果資料的長度小於256位元組,則使用短格式,用一個位元組儲存長度。 * 否則使用長形式。 * * (These structs are currently not used in the code, they are here just for * documentation purposes). * (這些結構體不會再程式碼中使用,在這裡是為了文件記錄的目的) */ typedef struct XLogRecordDataHeaderShort { uint8id;/* XLR_BLOCK_ID_DATA_SHORT */ uint8data_length;/* number of payload bytes */ }XLogRecordDataHeaderShort; #define SizeOfXLogRecordDataHeaderShort (sizeof(uint8) * 2) typedef struct XLogRecordDataHeaderLong { uint8id;/* XLR_BLOCK_ID_DATA_LONG */ /* followed by uint32 data_length, unaligned */ //接下來是無符號32位整型的data_length(未對齊) }XLogRecordDataHeaderLong; #define SizeOfXLogRecordDataHeaderLong (sizeof(uint8) + sizeof(uint32)) /* * Block IDs used to distinguish different kinds of record fragments. Block * references are numbered from 0 to XLR_MAX_BLOCK_ID. A rmgr is free to use * any ID number in that range (although you should stick to small numbers, * because the WAL machinery is optimized for that case). A couple of ID * numbers are reserved to denote the "main" data portion of the record. * 塊id用於區分不同型別的記錄片段。 * 塊引用編號從0到XLR_MAX_BLOCK_ID。 * rmgr可以自由使用該範圍內的任何ID號 *(儘管您應該堅持使用較小的數字,因為WAL機制針對這種情況進行了優化)。 * 保留兩個ID號來表示記錄的“main”資料部分。 * * The maximum is currently set at 32, quite arbitrarily. Most records only * need a handful of block references, but there are a few exceptions that * need more. * 目前的最大值是32,非常隨意。 * 大多數記錄只需要少數塊引用,但也有少數例外需要更多。 */ #define XLR_MAX_BLOCK_ID32 #define XLR_BLOCK_ID_DATA_SHORT255 #define XLR_BLOCK_ID_DATA_LONG254 #define XLR_BLOCK_ID_ORIGIN253 #endif/* XLOGRECORD_H */