Production Zookeeper Issue 3: connection reset

现象:

一台机器的连接数忽然降低到个位数,且不恢复。

原因:

根据之前的排除经验,查看epoch指标,并没有发生变化,排除选举发生;

查看zxid没有发生变化,排除zxid。

排查packetReceived指标降低为0,说明自己关闭了自己的ZooKeeperServer。

查看机器日志如下:

 
2023-09-01 04:02:01,662 [myid:13] - WARN [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):Follower@128] - Exception when following the leader java.net.SocketException: Connection reset at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:323) at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:350) at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:803) at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966) at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478) at java.base/sun.security.ssl.SSLSocketInputRecord.readFully(SSLSocketInputRecord.java:461) at java.base/sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:243) at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:181) at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:111) at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1503) at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1474) at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1059) at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244) at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263) at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:228) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:124) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1522) 2023-09-01 04:02:01,662 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):Follower@142] - Disconnected from leader (with address: /10.14.6.56:2888). Was connected for 316796607ms. Sync state: true 

很明显,出现问题的诱因是Connection reset问题,然后断开连接。第二行的日志显示它是和leader之间的连接出了问题,那么出现这种错误会如何呢?

首先,查询代码可知,能输出”Disconnected from leader”错误只能是Follower和Observer,以Follower为例:

org.apache.zookeeper.server.quorum.Follower#followLeader:
void followLeader() throws InterruptedException {
    self.end_fle = Time.currentElapsedTime();
    long electionTimeTaken = self.end_fle - self.start_fle;
    self.setElectionTimeTaken(electionTimeTaken);
    ServerMetrics.getMetrics().ELECTION_TIME.add(electionTimeTaken);
    LOG.info("FOLLOWING - LEADER ELECTION TOOK - {} {}", 
    //承担Follower角色

    try {
           QuorumServer leaderServer = findLeader();
        try {
            
            connectToLeader(leaderServer.addr, leaderServer.hostname);
            connectionTime = System.currentTimeMillis();
            //省略非核心
            QuorumPacket qp = new QuorumPacket();
            while (this.isRunning()) {
                //不断处理业务
                readPacket(qp);
                processPacket(qp);
            }
        } catch (Exception e) {
            LOG.warn("Exception when following the leader", e);
            closeSocket(); //关闭有问题的socket
         }
    } finally {
        //省略若干代码

        if (connectionTime != 0) {
            long connectionDuration = System.currentTimeMillis() - connectionTime;
            LOG.info(
                "Disconnected from leader (with address: {}). Was connected for {}ms. Sync state: {}",
                leaderAddr,
                connectionDuration,
                completedSync);
        }
    }

上面代码可以看出follower会一直处理业务请求,直到遇到错误,当遇到错误时,它会关闭连接,同时推出这个业务处理死循环,代码继续走到调用方,而调用方其实是QuorumPeer#run调用:

case FOLLOWING:
    try {
        LOG.info("FOLLOWING");
        setFollower(makeFollower(logFactory));
        follower.followLeader();
    } catch (Exception e) {
        LOG.warn("Unexpected exception", e);
    } finally {
        follower.shutdown();
        setFollower(null);
        updateServerState(); 
    }
    break;

可以看出,follower.followLeader()平时不会执行退出,但是一旦执行完走出来,就会关闭Follower。然后更新状态为looking, 重新选举:


2023-09-01 04:02:01,971 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):QuorumPeer@916] - Peer state changed: looking
2023-09-01 04:02:01,971 [myid:13] - WARN [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):QuorumPeer@1568] - PeerState set to LOOKING
2023-09-01 04:02:01,971 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):QuorumPeer@1438] - LOOKING
2023-09-01 04:02:01,971 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):FastLeaderElection@945] - New election. My id = 13, proposed zxid=0x5c10c6626f
2023-09-01 04:02:01,971 [myid:13] - INFO [WorkerReceiver[myid=13]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:13, n.state:LOOKING, n.leader:13, n.round:0x5de, n.peerEpoch:0x5c, n.zxid:0x5c10c6626f, message format version:0x2, n.config version:0x4e8eae578d
2023-09-01 04:02:01,972 [myid:13] - INFO [WorkerReceiver[myid=13]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:14, n.state:FOLLOWING, n.leader:37, n.round:0x5dd, n.peerEpoch:0x5c, n.zxid:0x5bfffffffe, message format version:0x2, n.config version:0x4e8eae578d
2023-09-01 04:02:02,020 [myid:13] - INFO [WorkerReceiver[myid=13]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:24, n.state:FOLLOWING, n.leader:37, n.round:0x5dd, n.peerEpoch:0x5c, n.zxid:0x5bfffffffe, message format version:0x2, n.config version:0x4e8eae578d
2023-09-01 04:02:02,021 [myid:13] - INFO [WorkerReceiver[myid=13]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:23, n.state:FOLLOWING, n.leader:37, n.round:0x5dd, n.peerEpoch:0x5c, n.zxid:0x5bfffffffe, message format version:0x2, n.config version:0x4e8eae578d
2023-09-01 04:02:02,049 [myid:13] - INFO [WorkerReceiver[myid=13]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:36, n.state:FOLLOWING, n.leader:37, n.round:0x5dc, n.peerEpoch:0x5c, n.zxid:0x5bfffffffe, message format version:0x2, n.config version:0x4e8eae578d
2023-09-01 04:02:02,070 [myid:13] - INFO [nioEventLoopGroup-7-8:NettyServerCnxn@294] - Processing conf command from /170.114.0.148:64690
2023-09-01 04:02:02,074 [myid:13] - INFO [WorkerReceiver[myid=13]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:37, n.state:LEADING, n.leader:37, n.round:0x5dd, n.peerEpoch:0x5c, n.zxid:0x5bfffffffe, message format version:0x2, n.config version:0x4e8eae578d
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):QuorumPeer@902] - Peer state changed: following
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):QuorumPeer@1520] - FOLLOWING
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ZooKeeperServer@1306] - minSessionTimeout set to 4000
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ZooKeeperServer@1315] - maxSessionTimeout set to 40000
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ResponseCache@45] - getData response cache size is initialized with value 400.
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ResponseCache@45] - getChildren response cache size is initialized with value 400.
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):RequestPathMetricsCollector@109] - zookeeper.pathStats.slotCapacity = 60
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):RequestPathMetricsCollector@110] - zookeeper.pathStats.slotDuration = 15
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):RequestPathMetricsCollector@111] - zookeeper.pathStats.maxDepth = 6
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):RequestPathMetricsCollector@112] - zookeeper.pathStats.initialDelay = 5
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):RequestPathMetricsCollector@113] - zookeeper.pathStats.delay = 5
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):RequestPathMetricsCollector@114] - zookeeper.pathStats.enabled = false
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ZooKeeperServer@1542] - The max bytes for all large requests are set to 104857600
2023-09-01 04:02:02,074 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ZooKeeperServer@1556] - The large request threshold is set to -1
2023-09-01 04:02:02,075 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):AuthenticationHelper@66] - zookeeper.enforce.auth.enabled = false
2023-09-01 04:02:02,075 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):AuthenticationHelper@67] - zookeeper.enforce.auth.schemes = []
2023-09-01 04:02:02,075 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ZooKeeperServer@361] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 clientPortListenBacklog -1 datadir /data/zookeeper/version-2 snapdir /data/zookeeper/version-2
2023-09-01 04:02:02,075 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):Follower@77] - FOLLOWING - LEADER ELECTION TOOK - 104 MS
2023-09-01 04:02:02,075 [myid:13] - INFO [QuorumPeer[myid=13](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):QuorumPeer@916] - Peer state changed: following - discovery

2023-09-01 04:02:02,215 [myid:13] - INFO [WorkerReceiver[myid=13]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:FOLLOWING; n.sid:35, n.state:FOLLOWING, n.leader:37, n.round:0x5dc, n.peerEpoch:0x5c, n.zxid:0x5bfffffffe, message format version:0x2, n.config version:0x4e8eae578d

通过日志中的“LEADER ELECTION TOOK – 104 MS”,可以看出,花了104ms再次重新充当了Follower。考虑到这个时间很短,影响应该很小。

再回到这个问题的原始触发点,为什么会遇到connection reset,可以查看leader上是的日志线索:


2023-09-01 04:02:01,248 [myid:37] - WARN [QuorumPeer[myid=37](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):LearnerHandler@1077] - Closing connection to peer due to transaction timeout.
2023-09-01 04:02:01,248 [myid:37] - INFO [QuorumPeer[myid=37](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):LearnerHandler@1158] - Synchronously closing socket to learner 13.

通过上述日志可以看出是由于“transaction timeout”,那么这种错误指的是什么?什么时候会发生?找到代码所在地,它的位置就是LearnerHandler#ping方法?

public void ping() {
 
    long id;
    if (syncLimitCheck.check(System.nanoTime())) {
        id = learnerMaster.getLastProposed();
        QuorumPacket ping = new QuorumPacket(Leader.PING, id, null, null);
        queuePacket(ping);
    } else {
        LOG.warn("Closing connection to peer due to transaction timeout.");
        shutdown();
    }
}


顾名思义,肯定是做健康检查的。那什么时候会ping?  Leader启动后会执行Leader#lead方法,这个方法会一直持续检查follower/observer的情况,以决定是不是要断掉它们。

void lead() throws IOException, InterruptedException {
       long start =   Time.currentElapsedTime();
       long cur = start;
       long end = start + self.tickTime / 2;
       while (cur < end) {
          wait(end - cur);
          cur = Time.currentElapsedTime();
       } 

        startZkServer();
          
 
        while (true) {
            synchronized (this) {
                long start = Time.currentElapsedTime();
               
            for (LearnerHandler f : getLearners()) {
                f.ping();  //ping的调用,循环去ping follower/observer
            }
        }
 
    }  
}

同时仔细参考上述代码,还可以看出,它是每间隔一段时间做一次。间隔时间即上述代码中的“self.tickTime / 2”,其参数可配,默认如下:

public static final int DEFAULT_TICK_TIME = 3000;
protected int tickTime = DEFAULT_TICK_TIME;

那么,一旦做健康检查,它做的时机是什么?可以参考LearnerHandler.SyncLimitCheck,关键代码如下:

private class SyncLimitCheck {

    private boolean started = false;
    private long currentZxid = 0;
    private long currentTime = 0;
    private long nextZxid = 0;
    private long nextTime = 0;

    public synchronized void start() {
        started = true;
    }

    public synchronized void updateProposal(long zxid, long time) {
        if (!started) {
            return;
        }
        if (currentTime == 0) {
            currentTime = time;
            currentZxid = zxid;
        } else {
            nextTime = time;
            nextZxid = zxid;
        }
    }

    public synchronized void updateAck(long zxid) {
        if (currentZxid == zxid) {
            currentTime = nextTime;
            currentZxid = nextZxid;
            nextTime = 0;
            nextZxid = 0;
        } else if (nextZxid == zxid) {
            LOG.warn(
                "ACK for 0x{} received before ACK for 0x{}",
                Long.toHexString(zxid),
                Long.toHexString(currentZxid));
            nextTime = 0;
            nextZxid = 0;
        }
    }

    public synchronized boolean check(long time) {
        if (currentTime == 0) {
            return true;
        } else {
            long msDelay = (time - currentTime) / 1000000;
            return (msDelay < learnerMaster.syncTimeout());
        }
    }

}

翻阅上面代码,结合之前的代码,大致可以了解到,在执行健康检查时,如果有正在做的业务,且做的业务请求如果已经在指定时间内还没有收到响应就算不健康了。如果它还没有到达超时时间,则发送一个ping请求到队列来做健康检查。如果当前根本没有业务,则也是直接发送一个ping。其中,超时时间也是配置项,不过它是必须配置的项目。

/**
* The number of ticks that can pass between sending a request and getting
* an acknowledgment
*/
protected volatile int syncLimit;/**
* The number of milliseconds of each tick
*/
protected int tickTime;
@Override
public int syncTimeout() {
    return self.tickTime * self.syncLimit;
}

这里可以给出一个配置示例:

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5

通过这个配置,可知,Leader每1秒做一次和Follower/Observer之间的健康检查,业务请求或者ping的响应最大超时时间为10秒。如果超过了,则leader会断开它们,然后他们遇到connection reset重新上线。

解决方案:

实际上,这种问题,通过上述的不断追问,我们已经知道,它的问题在于一个请求为什么会超过10秒没有返回。这种情况实际很多,例如主机资源使用率太高,请求量太大等,无法及时处理请求(本应用已排除这些因素),除了这点外,还有一些原因是不可控的。例如网络问题,导致的临时不可用。所以针对这个问题的解决方案是需要进一步观察,例如排查主机请求量等因素外,记录该问题发生的频率和时间以及区域特点,以做进一步明确和分析。

Production Zookeeper Issue 2: zxid lower 32 bits have rolled over

现象:

应用在某个时间点有5分钟左右的不可用造成了大量业务zookeeper操作失败。

原因分析:

首先观察指标:

检查epoch指标可以确定是否发生了选举:可以看到epoch从91改成了92

检查zxid指标,可以看出发生了溢出: 

根据以上两个指标,基本上可以锁定发生了zxid溢出导致的leader重新选举,而选举时,集群不可用,这点可以参考zk的server日志可以核实。

2023-08-27 17:28:55,806 [myid:37] - INFO [nioEventLoopGroup-7-10:NettyServerCnxn@525] - Closing connection to /10.24.15.252:47146 because the server is not ready
2023-08-27 17:28:55,834 [myid:37] - INFO [nioEventLoopGroup-7-13:NettyServerCnxn@525] - Closing connection to /10.14.12.141:44252 because the server is not ready

那么zxid是什么,为什么它会溢出?溢出后的后果?这里可以简单分析下,在分析之前,登录服务器确认了下确实是zxid溢出问题:

 Shutdown called. For the reason zxid lower 32 bits have rolled over, forcing re-election, and therefore new epoch start 

确认问题无误后,可以解析下这个zxid功能:

为了保证事务的顺序一致性,Zookeeper 采用了递增的事务ID号(即zxid)来标识事务,所有的操作( proposal )都会在被提出时加上zxid,zxid是一个64位的数字:高32位是epoch用来标识leader关系是否发生变化,每当有新的leader 被选举出来,都会有一个新的epoch,标识当前属于多少代领导时期。例如高位为92,则可以简单理解为第92代“皇帝”。这个可以像前文提及的通过JMX的指标epoch变化来看是否发生了选举。后32位可以理解为在这个朝代发生的事务。事务这里其实就是改变zk的行为,所以像读取操作并不是一种事务。

当改变zk行为的请求过来时,zk都是会让Leader来处理,而Leader处理的第一件事就是将这个请求(request)转化为一个“事务”来做。这种转化做的非常重要的一件事就是产生一个Id给通过设置TxnHeader的方式给它(request)(注意下:Follower之类的非leader不会产生zxid)。这样为后续事务的处理的有序性都提供了保障。可以参考下面代码看下请求的zxid设置过程:

org.apache.zookeeper.server.PrepRequestProcessor#pRequestHelper:
private void pRequestHelper(Request request) {
    try {
        switch (request.type) {
        case OpCode.createContainer:
        case OpCode.create:
        case OpCode.create2:
            CreateRequest create2Request = request.readRequestRecord(CreateRequest::new);
            pRequest2Txn(request.type, zks.getNextZxid(), request, create2Request);
            break;

其中zks.getNextZxid()即为取递增的Id:
org.apache.zookeeper.server.ZooKeeperServer#getNextZxid:
long getNextZxid() {
    return hzxid.incrementAndGet();
}

这个id本身是递增的,所以一定会溢出,当溢出后,在leader处理到会发生溢出的请求时会遇到抛出异常

org.apache.zookeeper.server.quorum.Leader#propose:
public Proposal propose(Request request) throws XidRolloverException {
 
    if ((request.zxid &amp;amp;amp;amp;amp; 0xffffffffL) == 0xffffffffL) {
        String msg = "zxid lower 32 bits have rolled over, forcing re-election, and therefore new epoch start";
        shutdown(msg);
        throw new XidRolloverException(msg);
    }

而一旦抛出异常,最后代码的处理是关闭leader,然后触发重新选举。从而新的一轮选举就此发生。在选举成功这段期间,服务并不可用,最终成功时会输出下面这样的日志:


2023-08-27 17:34:23,137 [myid:37] - INFO [QuorumPeer[myid=37](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ZooKeeperServer@361] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 clientPortListenBacklog -1 datadir /data/zooke
eper/version-2 snapdir /data/zookeeper/version-2
2023-08-27 17:34:23,138 [myid:37] - INFO [QuorumPeer[myid=37](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):Leader@584] - LEADING - LEADER ELECTION TOOK - 342465 MS

通过这个日志可以看出选举花了多久,例如上面花了约6分钟,这也意味着服务不可用达6分钟。

问题解决:

通过上面的分析可知,zk的zxid是必然发生的问题,那么遇到这个问题的时候,我们可以先做两个方面的评估:

(1)每次影响时间是多少,上面已经统计这点可以通过日志直接获取这样的数据;本应用约为6分钟。

(2)发生的频率是多少,通过原理分析,很容易知道,这取决于业务的请求量大小。例如对于本应用,可以通过查询最近2次发生的日志的时间来推断,如果日志没有保存很久,也可以通过zxid指标的增长速度来估算。


例如取1周的变化数据(不考虑业务的月/季度变化):

time: 2023-09-01 03:00:00 zxid {host: ip-10-0-4-123}: 280105949.2
time: 2023-09-08 00:00:00 zxid {host: ip-10-0-4-123}: 692368820.8
由上述数据可以:一周4.2亿,每天0.6亿。

以zxid最多承接多少id,除以每天即为粗略的溢出间隔4294967296 ÷ 60000000 = 约71(天)

综合评估下,就是每2-3月会发生一次,每次影响约5-6分钟。

那么如何解决或者说缓和这个问题呢,无非以下思路:

(1)延缓发生的时间,例如可以减少必须要的业务请求量。再则网上也有人做过一种改进,修改zookeeper的代码重新打包,具体修改是将64位的后32位Id调整为更多位数,可行性是基于选举是小概率事件,不需要前32位那么多。不过个人认为,这种修改对以后的升级维护很不友好。

(2)减少影响的时间,也就是考虑如何减少选举的时间,例如对本应用而言,其实是7个Node,都是Follower角色,是不是可以将少许Follower转化为Observer来启动,这样参与选举的节点就少了,选举自然也会快些。这里还需要验证下。

当然,除了直接的方法外,还有一些其他间接的方法,例如减少对zk的依赖或者换其他存储,这里不做过多展开。

题外:

这里再补充一些这个问题定位时的弯路分析,在开始遇到这个问题时,实际上通过流量的变化(org.apache.zookeeper.server.ServerStats#packetsReceived对应的JMX指标)发现的,例如下图:

当时可视化系统也没有展示上文描述的2个重要指标,所以当时的第一感觉肯定是服务器发生了重启,因为这个统计指标表明的是zk接受到的包的总和,正常都是一直持续增加的,忽然归0,就很容易觉得是重启了。但是通过linux的last命令查看发生问题的世间点并没有人来操作,系统的日志也没有提示任何重启:

//last执行效果:

[root@ip-10-0-0-247 user]# last |more
user pts/1 38.99.100.2 Mon Sep 11 01:06 still logged in
user pts/1 38.99.100.2 Fri Sep 8 06:15 - 10:25 (04:09)
stroller. pts/3 38.99.100.2 Thu Sep 7 02:24 - 06:40 (04:16)

后来经过源码研究发现,其实这个指标归0除了zk重启外,还有2种情况,这2种情况不会重启zk:

(1) 四字命令srst的执行:

public static class StatResetCommand extends GetCommand {

    public StatResetCommand() {
        super(Arrays.asList("stat_reset", "srst"));
    }

    @Override
    public CommandResponse runGet(ZooKeeperServer zkServer, Map<String, String> kwargs) {
        CommandResponse response = initializeResponse();
        zkServer.serverStats().reset();
        return response;
    }

}

(2)遇到zxid溢出等错误,导致shutdown并引发重新选举,就本案例而言,这个时候,相当于把leader给先关闭了,然后重新获取一个新的角色提供服务(LeaderZooKeeperServer/FollowerZooKeeperServer等)。这个新服务对像的指标数据存储对象(父类对象ZooKeeperServer#serverStats)就是初始化状态。总体来说:这里就等于在不关闭java进程的情况下做了“重启”。

这点也可以从前面贴出的两条日志得到严重,先shutdown,后重新创建了zk server:

//新的server:
2023-08-27 17:34:23,137 [myid:37] - INFO [QuorumPeer[myid=37](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):ZooKeeperServer@361] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 clientPortListenBacklog -1 datadir /data/zookeeper/version-2 snapdir /data/zookeeper/version-2

另外,ServerStats#packetsReceived很可能被理解为数据包个数,从而发生这样的误会:例如,一个请求可能拆分成两个数据包,那么packetsReceived是2,实际上,packetsReceived可以理解为请求数。参考代码 NettyServerCnxn#receiveMessage:

receiveMessage中的关键逻辑:
if (bb.remaining() == 0) { //请求的数据读完整时(读取了len长度)符合条件。
    bb.flip();
   //调用metrics
   packetReceived(4 + bb.remaining());
   //转化为请求然后进行后续处理
   RequestHeader h = new RequestHeader();
   ByteBufferInputStream.byteBuffer2Record(bb, h);
   RequestRecord request = RequestRecord.fromBytes(bb.slice());
   zks.processPacket(this, h, request);
     
}

//metris调用方法
protected void packetReceived(long bytes) {
    ServerStats serverStats = serverStats();
    if (serverStats != null) {
         serverStats().incrementPacketsReceived();
      //调用packetsReceived.incrementAndGet(),对应packetsReceived
     }
ServerMetrics.getMetrics().BYTES_RECEIVED_COUNT.add(bytes);
      //对应bytes_received_count
}

 

所以虽然查看问题,绕了弯路,但是也可以学习到更多知识。对出现问题的时如何看懂各种数据有了很多帮忙。