Redis-複製

Redis · 發表 2018-10-08 08:48:00

摘要：複製 A few things to understand ASAP about Redis replication. 1) Redis replication is asynchronous, but you can configure a master to stop acc...

複製

A few things to understand ASAP about Redis replication.

1) Redis replication is asynchronous, but you can configure a master to
stop accepting writes if it appears to be not connected with at least
a given number of slaves.
2) Redis slaves are able to perform a partial resynchronization with the
master if the replication link is lost for a relatively small amount of
time. You may want to configure the replication backlog size (see the next
sections of this file) with a sensible value depending on your needs.
3) Replication is automatic and does not need user intervention. After a
network partition slaves automatically try to reconnect to masters
and resynchronize with them.

複製的實現

1. 設定主節點的地址和埠

簡而言之，是執行SLAVEOF命令，該命令是個非同步命令，在設定完masterhost和masterport屬性之後，從節點將向傳送SLAVEOF的客戶端返回OK。表示複製指令已經被接受，而實際的複製工作將在OK返回之後才真正開始執行。

2. 建立套接字連線。

在執行完SLAVEOF命令後，從節點根據命令所設定的IP和埠，建立連向主節點的套接字連線。如果建立成功，則從節點將為這個套接字關聯一個專門用於處理複製工作的檔案事件處理器，這個處理器將負責執行後續的複製工作，比如接受RDB檔案，以及接受主節點傳播來的寫命令等。

3. 傳送PING命令。

從節點成為主節點的客戶端之後，首先會向主節點發送一個PING命令，其作用如下：

1. 檢查套接字的讀寫狀態是否正常。

2. 檢查主節點是否能正常處理命令請求。

如果從節點讀取到“PONG”的回覆，則表示主從節點之間的網路連線狀態正常，並且主節點可以正常處理從節點發送的命令請求。

4. 身份驗證

從節點在收到主節點返回的“PONG”回覆之後，接下來會做的就是身份驗證。如果從節點設定了masterauth選項，則進行身份驗證。反之則不進行。

在需要進行身份驗證的情況下，從節點將向主節點發送一條AUTH命令，命令的引數即可從節點masterauth選項的值。

5. 傳送埠資訊。

在身份驗證之後，從節點將執行REPLCONF listening-port <port-number>，向主節點發送從節點的監聽埠號。

主節點會將其記錄在對應的客戶端狀態的slave_listening_port屬性中，這點可通過info Replication檢視。

127.0.0.1:6379> info Replication
# Replication
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=6380,state=online,offset=3696,lag=0

6. 同步。

從節點向主節點發送PSYNC命令，執行同步操作，並將自己的資料庫更新至主節點資料庫當前所處的狀態。

7. 命令傳播

當完成了同步之後，主從節點就會進入命令傳播階段。這時主節點只要一直將自己執行的寫命令傳送到從節點，而從節點只要一直接收並執行主節點發來的寫命令，就可以保證主從節點保持一致了。

8. 心跳檢測

在命令傳播階段，從節點預設會以每秒一次的頻率，向主節點發送命令。

REPLCONF ACK <replication_offset>

其中，replication_offset是從節點當前的複製偏移量。

傳送REPLCONF ACK主從節點有三個作用：

1> 檢測主從節點的網路連線狀態。

2> 輔助實現min-slave選項。

3> 檢查是否存在命令丟失。

REPLCONF ACK命令和複製積壓緩衝區是Redis 2.8版本新增的，在此之前，即使命令在傳播過程中丟失，主從節點都不會注意到。

複製的相關引數

slaveof <masterip> <masterport>
masterauth <master-password>

slave-serve-stale-data yes

slave-read-only yes

repl-diskless-sync no

repl-diskless-sync-delay 5

repl-ping-slave-period 10

repl-timeout 60

repl-disable-tcp-nodelay no

repl-backlog-size 1mb

repl-backlog-ttl 3600

slave-priority 100

min-slaves-to-write 3
min-slaves-max-lag 10

slave-announce-ip 5.5.5.5
slave-announce-port 1234

其中，

slaveof <masterip> <masterport>：開啟複製，只需這條命令即可。

masterauth <master-password>：如果master中通過requirepass引數設定了密碼，則slave中需設定該引數。

slave-serve-stale-data：當主從連線中斷，或主從複製建立期間，是否允許slave對外提供服務。預設為yes，即允許對外提供服務，但有可能會讀到髒的資料。

slave-read-only：將slave設定為只讀模式。需要注意的是，只讀模式針對的只是客戶端的寫操作，對於管理命令無效。

repl-diskless-sync，repl-diskless-sync-delay：是否使用無盤複製。為了降低主節點磁碟開銷，Redis支援無盤複製，生成的RDB檔案不儲存到磁碟而是直接通過網路傳送給從節點。無盤複製適用於主節點所在機器磁碟效能較差但網路寬頻較充裕的場景。需要注意的是，無盤複製目前依然處於實驗階段。

repl-ping-slave-period：master每隔一段固定的時間向SLAVE傳送一個PING命令。

repl-timeout：複製超時時間。

# The following option sets the replication timeout for:
#
# 1) Bulk transfer I/O during SYNC, from the point of view of slave.
# 2) Master timeout from the point of view of slaves (data, pings).
# 3) Slave timeout from the point of view of masters (REPLCONF ACK pings).
#
# It is important to make sure that this value is greater than the value
# specified for repl-ping-slave-period otherwise a timeout will be detected
# every time there is low traffic between the master and the slave.

repl-disable-tcp-nodelay：設定為yes，主節點會等待一段時間才傳送TCP資料包，具體等待時間取決於Linux核心，一般是40毫秒。適用於主從網路環境複雜或頻寬緊張的場景。預設為no。

repl-backlog-size：複製積壓緩衝區，複製積壓緩衝區是儲存在主節點上的一個固定長度的佇列。用於從Redis 2.8開始引入的部分複製。

# Set the replication backlog size. The backlog is a buffer that accumulates
# slave data when slaves are disconnected for some time, so that when a slave
# wants to reconnect again, often a full resync is not needed, but a partial
# resync is enough, just passing the portion of data the slave missed while
# disconnected.
#
# The bigger the replication backlog, the longer the time the slave can be
# disconnected and later be able to perform a partial resynchronization.
#
# The backlog is only allocated once there is at least a slave connected.

只有slave連線上來，才會開闢backlog。

repl-backlog-ttl：如果master上的slave全都斷開了，且在指定的時間內沒有連線上，則backlog會被master清除掉。repl-backlog-ttl即用來設定該時長，預設為3600s，如果設定為0，則永不清除。

slave-priority：設定slave的優先順序，用於Redis Sentinel主從切換時使用，值越小，則提升為主的優先順序越高。需要注意的是，如果設定為0，則代表該slave不參加選主。

slave-announce-ip，slave-announce-port ：常用於埠轉發或NAT場景下，對Master暴露真實IP和埠資訊。

同步的過程

1. 從節點向主節點發送PSYNC命令。

2. 收到PSYNC命令的主節點執行BGSAVE命令，在後臺生成一個RDB檔案，並使用一個緩衝區記錄從現在開始執行的所有寫命令。

3. 當主節點的BGSAVE命令執行完畢時，主節點會將BGSAVE命令生成的RDB檔案傳送給從節點，從節點接受並載入這個RDB檔案，將自己的資料庫狀態更新至主節點執行BGSAVE命令時的資料庫狀態。

4. 主節點將記錄在緩衝區裡面的所有寫命令傳送給從節點，從節點執行這些寫命令，將自己的資料庫狀態更新至主節點資料庫當前所處的狀態。

需要注意的是，在步驟2中提到的緩衝區，其實是有大小限制的，其由client-output-buffer-limit slave 256mb 64mb 60決定，該引數的語法及解釋如下：

# client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds>
#
# A client is immediately disconnected once the hard limit is reached, or if
# the soft limit is reached and remains reached for the specified number of
# seconds (continuously).

意思是如果該緩衝區的大小超過256M，或該緩衝區的大小超過64M，且持續了60s，主節點會馬上斷開從節點的連線。斷開連線後，在60s之後（repl-timeout），從節點發現沒有從主節點中獲得資料，會重新啟動複製。

在Redis 2.8之前，如果因網路原因，主從節點複製中斷，當再次建立連線時，還是會執行SYNC命令進行全量複製。效率較為低下。從Redis 2.8開始，引入了PSYNC命令代替SYNC命令來執行復制時的同步操作。

PSYNC命令具有全量同步（full resynchronization）和增量同步（partial resynchronization）。

全量同步的日誌：

master：

19544:M 05 Oct 20:44:04.713 * Slave 127.0.0.1:6380 asks for synchronization
19544:M 05 Oct 20:44:04.713 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'dc419fe03ddc9ba30cf2a2cf1894872513f1ef96', my 
replication IDs are 'f8a035fdbb7cfe435652b3445c2141f98a65e437' and '0000000000000000000000000000000000000000')19544:M 05 Oct 20:44:04.713 * Starting BGSAVE for SYNC with target: disk
19544:M 05 Oct 20:44:04.713 * Background saving started by pid 20585
20585:C 05 Oct 20:44:04.723 * DB saved on disk
20585:C 05 Oct 20:44:04.723 * RDB: 0 MB of memory used by copy-on-write
19544:M 05 Oct 20:44:04.813 * Background saving terminated with success
19544:M 05 Oct 20:44:04.814 * Synchronization with slave 127.0.0.1:6380 succeeded

slave：

19746:S 05 Oct 20:44:04.288 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new
 master with just a partial transfer.19746:S 05 Oct 20:44:04.288 * SLAVE OF 127.0.0.1:6379 enabled (user request from 'id=3 addr=127.0.0.1:37128 fd=8 name= age=929 idle=0 flags=N db=0 sub=0 psub=
0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')19746:S 05 Oct 20:44:04.712 * Connecting to MASTER 127.0.0.1:6379
19746:S 05 Oct 20:44:04.712 * MASTER <-> SLAVE sync started
19746:S 05 Oct 20:44:04.712 * Non blocking connect for SYNC fired the event.
19746:S 05 Oct 20:44:04.713 * Master replied to PING, replication can continue...
19746:S 05 Oct 20:44:04.713 * Trying a partial resynchronization (request dc419fe03ddc9ba30cf2a2cf1894872513f1ef96:1191).
19746:S 05 Oct 20:44:04.713 * Full resync from master: f8a035fdbb7cfe435652b3445c2141f98a65e437:1190
19746:S 05 Oct 20:44:04.713 * Discarding previously cached master state.
19746:S 05 Oct 20:44:04.814 * MASTER <-> SLAVE sync: receiving 224566 bytes from master
19746:S 05 Oct 20:44:04.814 * MASTER <-> SLAVE sync: Flushing old data
19746:S 05 Oct 20:44:04.815 * MASTER <-> SLAVE sync: Loading DB in memory
19746:S 05 Oct 20:44:04.817 * MASTER <-> SLAVE sync: Finished with success

增量同步的日誌：

master：

19544:M 05 Oct 20:42:06.423 # Connection with slave 127.0.0.1:6380 lost.
19544:M 05 Oct 20:42:06.753 * Slave 127.0.0.1:6380 asks for synchronization
19544:M 05 Oct 20:42:06.753 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 0 bytes of backlog starting from offset 1037.

slave：

19746:S 05 Oct 20:42:06.423 # Connection with master lost.
19746:S 05 Oct 20:42:06.423 * Caching the disconnected master state.
19746:S 05 Oct 20:42:06.752 * Connecting to MASTER 127.0.0.1:6379
19746:S 05 Oct 20:42:06.752 * MASTER <-> SLAVE sync started
19746:S 05 Oct 20:42:06.752 * Non blocking connect for SYNC fired the event.
19746:S 05 Oct 20:42:06.753 * Master replied to PING, replication can continue...
19746:S 05 Oct 20:42:06.753 * Trying a partial resynchronization (request f8a035fdbb7cfe435652b3445c2141f98a65e437:1037).
19746:S 05 Oct 20:42:06.753 * Successful partial resynchronization with master.
19746:S 05 Oct 20:42:06.753 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

在Redis 4.0中，master_replid和offset儲存在RDB檔案中。當從節點被優雅的關閉並重新啟動時，Redis能夠從RDB檔案中重新載入master_replid和offset，從而使增量同步成為可能。

增量同步的實現依賴於以下三部分：

1. 主從節點的複製偏移量。

2. 主節點的複製積壓緩衝區。

3. 節點的執行ID（run ID）。

當一個從節點被提升為主節點時，其它的從節點必須與新主節點重新同步。在Redis 4.0 之前，因為master_replid發生了變化，所以這個過程是一個全量同步。在Redis 4.0之後，新主節點會記錄舊主節點的naster_replid和offset，因為能夠接受來自其它從節點的增量同步請求，即使請求中的master_replid不同。在底層實現上，當執行slaveof no one時，會將master_replid，master_repl_offset+1複製為master_replid，second_repl_offset。

複製相關變數

# Replication
role:master
connected_slaves:2
slave0:ip=127.0.0.1,port=6380,state=online,offset=5698,lag=0
slave1:ip=127.0.0.1,port=6381,state=online,offset=5698,lag=0
master_replid:e071f49c8d9d6719d88c56fa632435fba83e145d
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:5698
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:5698

# Replication
role:slave
master_host:127.0.0.1
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:126
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:15715bc0bd37a71cae3d08b9566f001ccbc739de
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:126
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:126

其中，

role: Value is "master" if the instance is replica of no one, or "slave" if the instance is a replica of some master instance. Note that a replica can be master of another replica (chained replication).

master_replid: The replication ID of the Redis server. 每個Redis節點啟動後都會動態分配一個40位的十六進位制字串作為執行ID。主的執行ID。

master_replid2: The secondary replication ID, used for PSYNC after a failover. 在執行slaveof no one時，會將master_replid，master_repl_offset+1複製為master_replid，second_repl_offset。

master_repl_offset: The server's current replication offset. Master的複製偏移量。

second_repl_offset: The offset up to which replication IDs are accepted.

repl_backlog_active: Flag indicating replication backlog is active 是否開啟了backlog。

repl_backlog_size: Total size in bytes of the replication backlog buffer. repl-backlog-size的大小。

repl_backlog_first_byte_offset: The master offset of the replication backlog buffer. backlog中儲存的Master最早的偏移量，

repl_backlog_histlen: Size in bytes of the data in the replication backlog buffer. backlog中資料的大小。

If the instance is a replica, these additional fields are provided:

master_host: Host or IP address of the master. Master的IP。

master_port: Master listening TCP port. Master的埠。

master_link_status: Status of the link (up/down). 主從之間的連線狀態。

master_last_io_seconds_ago: Number of seconds since the last interaction with master. 主節點每隔10s對從從節點發送PING命令，以判斷從節點的存活性和連線狀態。該變數代表多久之前，主從進行了心跳互動。

master_sync_in_progress: Indicate the master is syncing to the replica. 主節點是否在向從節點同步資料。個人覺得，應該指的是全量同步或增量同步。

slave_repl_offset: The replication offset of the replica instance. Slave的複製偏移量。

slave_priority: The priority of the instance as a candidate for failover. Slave的權重。

slave_read_only: Flag indicating if the replica is read-only. Slave是否處於可讀模式。

If a SYNC operation is on-going, these additional fields are provided:

master_sync_left_bytes: Number of bytes left before syncing is complete.

master_sync_last_io_seconds_ago: Number of seconds since last transfer I/O during a SYNC operation.

If the link between master and replica is down, an additional field is provided:

master_link_down_since_seconds: Number of seconds since the link is down. 主從連線中斷持續的時間。

The following field is always provided:

connected_slaves: Number of connected replicas. 連線的Slave的數量。

If the server is configured with the min-slaves-to-write (or starting with Redis 5 with the min-replicas-to-write) directive, an additional field is provided:

min_slaves_good_slaves: Number of replicas currently considered good。狀態正常的從節點的數量。

For each replica, the following line is added:

slaveXXX: id, IP address, port, state, offset, lag. Slave的狀態。

slave0:ip=127.0.0.1,port=6381,state=online,offset=1288,lag=1

如何監控主從延遲

# Replication
role:master
connected_slaves:2
slave0:ip=127.0.0.1,port=6381,state=online,offset=560,lag=0
slave1:ip=127.0.0.1,port=6380,state=online,offset=560,lag=0
master_replid:15715bc0bd37a71cae3d08b9566f001ccbc739de
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:560

其中，master_repl_offset是主節點的複製偏移量，slaveX中的offset即對應從節點的複製偏移量，兩者的差值即主從的延遲量。

如何評估backlog緩衝區的大小

t * (master_repl_offset2 - master_repl_offset1 ) / (t2 - t1)

t is how long the disconnections may last in seconds.

參考：

1. 《Redis開發與運維》

2. 《Redis設計與實現》

3. 《Redis 4.X Cookbook》