http protocol (1) – different types of timeout

项目中给http client 4.x设置了各种timeout,但是还是发现一些请求耗费的时间远超过预期,排查一部分请求是由于虚拟机或者GC等情况引起,还有一部分长耗时请求归咎于DNS解析的耗时,所以有必要梳理下http client使用中的各种timeout设置,从而对一个请求完成的最大时间有个正确而清晰的认识:

分析http client中,一次http请求的流程大致有2条主线如下:

(1)从连接池获取到连接->发送http请求->接受http响应

(2)从连接池获取连接,但是没有可用的,此时又可以创建新连接情况下-> DNS解析->建立连接(如果一个DNS可以解析成多个地址,第一个地址连不上,会挨个尝试)-> TLS握手(假设是https)->发送http请求->接受http响应。

分解上面的2条主线,至少有4种timeout(socket timeout, connection timeout, request connection timeout, dns resolve timeout)影响了一次http请求的完成时间。

(1)从连接池获取连接的最长等待时间
通过复用连接池的连接,可以避免不断频繁开关连接,提高效率,所以从连接池获取连接时,当连接池满时,可以设置等待一定的时间来获取连接,而不是永久等待。

1.1 错误堆栈:

  org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
! at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) 
! at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) 

1.2 源码解析:

   @Override
    public ConnectionRequest requestConnection(
            final HttpRoute route,
            final Object state) {
        Args.notNull(route, "HTTP route");
        if (this.log.isDebugEnabled()) {
            this.log.debug("Connection request: " + format(route, state) + formatStats(route));
        }
        final Future<CPoolEntry> future = this.pool.lease(route, state, null);  //从连接池获取连接
        return new ConnectionRequest() {

            @Override
            public boolean cancel() {
                return future.cancel(true);
            }

            @Override
            public HttpClientConnection get(
                    final long timeout,
                    final TimeUnit tunit) throws InterruptedException, ExecutionException, ConnectionPoolTimeoutException {
                return leaseConnection(future, timeout, tunit);
            }

        };

    }
    protected HttpClientConnection leaseConnection(
            final Future<CPoolEntry> future,
            final long timeout,
            final TimeUnit tunit) throws InterruptedException, ExecutionException, ConnectionPoolTimeoutException {
        final CPoolEntry entry;
        try {
            entry = future.get(timeout, tunit);
            if (entry == null || future.isCancelled()) {
                throw new InterruptedException();
            }
            Asserts.check(entry.getConnection() != null, "Pool entry with no connection");
            if (this.log.isDebugEnabled()) {
                this.log.debug("Connection leased: " + format(entry) + formatStats(entry.getRoute()));
            }
            return CPoolProxy.newProxy(entry);
        } catch (final TimeoutException ex) {
            throw new ConnectionPoolTimeoutException("Timeout waiting for connection from pool");
        }
    }

1.3 设置方法:

	RequestConfig requestConfig = RequestConfig.custom().
 				setConnectionRequestTimeout(2 * 1000).

(2)DNS解析时间

拿到域名建立连接之后,需要进行dns解析,一般dns解析都会设置cache时间,例如5分钟,但是为了防止绑死在某个主机上,都不会永久cache,所以dns解析不可避免,然而dns解析的时间控制除了本身将这个过程异步化外并未提供任何可以设置的时间参数。

2.1 堆栈错误:

Caused by: java.net.UnknownHostException: nebulaik.webex.com.cn: Temporary failure in name resolution
        at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at com.webex.dsagent.client.http.DSADnsResolver.resolve(DSADnsResolver.java:24)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:111)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)

2.2 源码分析:

大多请求遇到这种错误,都耗时15秒,原因在于系统配置(/etc/resolv.conf ):

options rotate timeout:5 retries:3
nameserver 10.224.91.88
nameserver 10.224.91.99

诊断方法可以使用dig命令:

dig命令提供了2个参数:

+time=T
Sets the timeout for a query to T seconds. The default timeout is 5 seconds. An attempt to set T to less than 1 will result in a query timeout of 1 second being applied.
+tries=T
Sets the number of times to try UDP queries to server to T instead of the default, 3. If T is less than or equal to zero, the number of tries is silently rounded up to 1.

例子:

[root@centos~]# time dig www.baidu.com +time=1

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.4 <<>> www.baidu.com +time=1
;; global options: +cmd
;; connection timed out; no servers could be reached

real  0m3.010s
user 0m0.003s
sys  0m0.007s
[root@centos~]# time dig www.baidu.com +time=5

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.4 <<>> www.baidu.com +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached

real  0m15.013s
user 0m0.006s
sys  0m0.005s

另外一个追踪方法和问题实例:解析aaa.com.cn超时了。

dig +trace aaa.com.cn //示例

使用dig +trace定位到:

aaa.com.cn. 600 IN CNAME bbb.com.cn.  //别名
bbb.com.cn. 120 IN NS ns1.dns.com.cn.
bbb.com.cn. 120 IN NS ns2.dns.com.cn.

其中ns1.dns.com.cn和ns2.dns.com.cn负责解析这个域名,但是做了环境迁移,导致不可达。后来修改下这2条记录,就不再超时了。

2.3 设置方法:

要不修改配置文件(options rotate timeout:5 retries:3),要不将dns解析异步化。

附上各种DNS错误:
https://zhuanlan.zhihu.com/p/40659713

题外:如果一个DNS可以解析成多个地址,而第一个地址连不上时,则一个一个尝试,直到成功或者最后一个,所以很多timeout特别长,也在于此。源码分析:

org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(ManagedHttpClientConnection, HttpHost, InetSocketAddress, int, SocketConfig, HttpContext)

   @Override
    public void connect(
            final ManagedHttpClientConnection conn,
            final HttpHost host,
            final InetSocketAddress localAddress,
            final int connectTimeout,
            final SocketConfig socketConfig,
            final HttpContext context) throws IOException {
        final Lookup<ConnectionSocketFactory> registry = getSocketFactoryRegistry(context);
        final ConnectionSocketFactory sf = registry.lookup(host.getSchemeName());
        if (sf == null) {
            throw new UnsupportedSchemeException(host.getSchemeName() +
                    " protocol is not supported");
        }
        final InetAddress[] addresses = host.getAddress() != null ?
                new InetAddress[] { host.getAddress() } : this.dnsResolver.resolve(host.getHostName());
        final int port = this.schemePortResolver.resolve(host);
        for (int i = 0; i < addresses.length; i++) {
            final InetAddress address = addresses[i];
            final boolean last = i == addresses.length - 1;

            Socket sock = sf.createSocket(context);
            sock.setSoTimeout(socketConfig.getSoTimeout());
            sock.setReuseAddress(socketConfig.isSoReuseAddress());
            sock.setTcpNoDelay(socketConfig.isTcpNoDelay());
            sock.setKeepAlive(socketConfig.isSoKeepAlive());
            final int linger = socketConfig.getSoLinger();
            if (linger >= 0) {
                sock.setSoLinger(true, linger);
            }
            conn.bind(sock);

            final InetSocketAddress remoteAddress = new InetSocketAddress(address, port);
            if (this.log.isDebugEnabled()) {
                this.log.debug("Connecting to " + remoteAddress);
            }
            try {
                sock = sf.connectSocket(
                        connectTimeout, sock, host, remoteAddress, localAddress, context);
                conn.bind(sock);
                if (this.log.isDebugEnabled()) {
                    this.log.debug("Connection established " + conn);
                }
                return;
            } catch (final SocketTimeoutException ex) {
                if (last) {
                    throw new ConnectTimeoutException(ex, host, addresses);
                }
            } catch (final ConnectException ex) {
                if (last) {
                    final String msg = ex.getMessage();
                    if ("Connection timed out".equals(msg)) {
                        throw new ConnectTimeoutException(ex, host, addresses);
                    } else {
                        throw new HttpHostConnectException(ex, host, addresses);
                    }
                }
            } catch (final NoRouteToHostException ex) {
                if (last) {
                    throw ex;
                }
            }
            if (this.log.isDebugEnabled()) {
                this.log.debug("Connect to " + remoteAddress + " timed out. " +
                        "Connection will be retried using another IP address");
            }
        }
    }

(3)连接建立时间

3.1 错误的堆栈:

Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to demosite.yy.zz:443 failed: connect timed out
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.jboss.resteasy.client.jaxrs.engines.ApacheHttpClient4Engine.invoke(ApacheHttpClient4Engine.java:312)
        ... 69 more
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:337)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
        ... 79 more

3.2 源码解析:

    void socketConnect(InetAddress address, int port, int timeout)
        throws IOException {
        int nativefd = checkAndReturnNativeFD();
 
        int connectResult;
        if (timeout <= 0) {  //不设置timeout的时间,永久阻塞
            connectResult = connect0(nativefd, address, port);
        } else {  //设置timeout时间,只等timeout时间。
            configureBlocking(nativefd, false);  //先设置为非阻塞模式
            try {
                connectResult = connect0(nativefd, address, port); //做个连接
                if (connectResult == WOULDBLOCK) { 
                    waitForConnect(nativefd, timeout); //最多等待阻塞timoout时间
                }
            } finally {
                configureBlocking(nativefd, true);  //设置回阻塞模式
            }
        }
 
    }
 

3.3 设置方法:

  		RequestConfig requestConfig = RequestConfig.custom().
 		setConnectTimeout(2 * 1000).

(4)数据处理时间:

发送完请求后,等待响应需要一定的时间,这个时间即为数据处理时间。除此之外,建立连接后的tls握手过程,也有多个请求响应过程,这个时间参数对这个交互过程也适用。

4.1 错误堆栈:

Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
        at sun.security.ssl.InputRecord.read(InputRecord.java:503)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)

4.2 源码解析:

http发完请求后,立马阻塞读取响应:

    public HttpResponse execute(
            final HttpRequest request,
            final HttpClientConnection conn,
            final HttpContext context) throws IOException, HttpException {
        try {
            HttpResponse response = doSendRequest(request, conn, context);
            if (response == null) {
                response = doReceiveResponse(request, conn, context); //阻塞读取响应
            }
            return response;
        }
    /**
     *  Enable/disable {@link SocketOptions#SO_TIMEOUT SO_TIMEOUT}
     *  with the specified timeout, in milliseconds. With this option set
     *  to a non-zero timeout, a read() call on the InputStream associated with
     *  this Socket will block for only this amount of time.  If the timeout
     *  expires, a <B>java.net.SocketTimeoutException</B> is raised, though the
     *  Socket is still valid. The option <B>must</B> be enabled
     *  prior to entering the blocking operation to have effect. The
     *  timeout must be {@code > 0}.
     *  A timeout of zero is interpreted as an infinite timeout.
     *
     * @param timeout the specified timeout, in milliseconds.
     * @exception SocketException if there is an error
     * in the underlying protocol, such as a TCP error.
     * @since   JDK 1.1
     * @see #getSoTimeout()
     */
    public synchronized void setSoTimeout(int timeout) throws SocketException {  //仅仅对读有效
        if (isClosed())
            throw new SocketException("Socket is closed");
        if (timeout < 0)
          throw new IllegalArgumentException("timeout can't be negative");

        getImpl().setOption(SocketOptions.SO_TIMEOUT, new Integer(timeout));
    }
    /**
     * Reads into an array of bytes at the specified offset using
     * the received socket primitive.
     * @param fd the FileDescriptor
     * @param b the buffer into which the data is read
     * @param off the start offset of the data
     * @param len the maximum number of bytes read
     * @param timeout the read timeout in ms
     * @return the actual number of bytes read, -1 is
     *          returned when the end of the stream is reached.
     * @exception IOException If an I/O error has occurred.
     */
    private native int socketRead0(FileDescriptor fd,
                                   byte b[], int off, int len,
                                   int timeout) //阻塞读取timeout时间

4.3 设置方法:

  	RequestConfig requestConfig = RequestConfig.custom().
 	setSocketTimeout(requestTimeoutInSecond * 1000).

假设不设置so_timeout的同时,设置了上文(3)中提到的connection timeout,则so_timeout = connection timeout

org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(int, Socket, HttpHost, InetSocketAddress, InetSocketAddress, HttpContext)
            if (connectTimeout > 0 && sock.getSoTimeout() == 0) {
                sock.setSoTimeout(connectTimeout);
            }

总结:
除了以上几种直接影响处理耗时的timeout参数为,尝试机制是另外一个影响处理耗时的机制,而且基本都是“翻倍”,这里不做赘述。所以从这个角度看,影响一次请求的耗时,主要包括以下因素:
(1) GC的STW效应
(2) 虚拟机的影响
(3) 应用层/系统层各种timeout参数
(4) 应用层/系统层重试次数

通过上文可知,除非将一个请求过程完全异步化,否则必须将所有的参数都了然于心才能很有信心的了解一个请求到底最长需要多久来完成,但是异步化的缺点就是管理的复杂性、缺乏负反馈等问题,需要综合权衡。

metric driven (6) – common arch solutions

TIG:

telegraf

1 install

1.1 create /etc/yum.repos.d/influxdb.repo:

[influxdb]
name = InfluxDB Repository – RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key

1.2 sudo yum install telegraf

1.3 startup

sudo service telegraf start
Or if your operating system is using systemd (CentOS 7+, RHEL 7+):
sudo systemctl start telegraf

2 Config:

默认配置文件为/etc/telegraf/telegraf.conf,也可以查看https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf, telegraf是通过输入、转化,输出插件方式来管理的。

所以默认什么都不做修改时的,telegraf收集的是如下信息:

inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu

而输出采用的是influxdb方式。这点可以通过启动日志来观察到:

2018/09/17 01:31:19 I! Using config file: /etc/telegraf/telegraf.conf
2018-09-17T01:31:19Z W! [outputs.influxdb] when writing to [http://localhost:8086]: database “telegraf” creation failed: Post http://localhost:8086/query: dial tcp 127.0.0.1:8086: connect: connection refused
2018-09-17T01:31:19Z I! Starting Telegraf v1.7.4
2018-09-17T01:31:19Z I! Loaded inputs: inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu
2018-09-17T01:31:19Z I! Loaded aggregators:
2018-09-17T01:31:19Z I! Loaded processors:
2018-09-17T01:31:19Z I! Loaded outputs: influxdb
2018-09-17T01:31:19Z I! Tags enabled: host=appOne
2018-09-17T01:31:19Z I! Agent Config: Interval:10s, Quiet:false, Hostname:”telegraf”, Flush Interval:10s
2018-09-17T01:31:30Z E! [outputs.influxdb]: when writing to [http://localhost:8086]: Post http://localhost:8086/write?db=telegraf: dial tcp 127.0.0.1:8086

所以如果需要修改或者定制可以直接修改/etc/telegraf/telegraf.conf达到目标,但是默认配置里面有太多冗余插件信息去注释掉,所以telegraf提供了一种简洁的方式来产生配置文件。

#telegraf –input-filter redis:cpu:mem:net:swap –output-filter influxdb:kafka config //采集多个指标
#telegraf –input-filter redis –output-filter influxdb config //采集一个指标

例如,产生一个redis.conf的配置:

#telegraf -sample-config -input-filter redis:mem -output-filter influxdb > redis.conf

产生后的配置内容如下:

###############################################################################
# INPUT PLUGINS #
###############################################################################

# Read metrics about memory usage
[[inputs.mem]]
# no configuration

[[inputs.redis]]
## specify servers via a url matching:
## [protocol://][:password]@address[:port]
## e.g.
## tcp://localhost:6379
## tcp://:password@192.168.99.100
## unix:///var/run/redis.sock
##
## If no servers are specified, then localhost is used as the host.
## If no port is specified, 6379 is used
servers = [“tcp://localhost:6379”]

###############################################################################
# OUTPUT PLUGINS #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
## The full HTTP or UDP URL for your InfluxDB instance.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
# urls = [“unix:///var/run/influxdb.sock”]
# urls = [“udp://127.0.0.1:8089”]
# urls = [“http://127.0.0.1:8086”]

## The target database for metrics; will be created as needed.
# database = “telegraf”
# username = “telegraf”
# password = “metricsmetricsmetricsmetrics”

然后以这个文件作为启动配置文件启动:

#telegraf –config /etc/telegraf/redis.conf

[root@telegraf ~]# telegraf –config /etc/telegraf/redis.conf
2018-09-17T02:43:08Z I! Starting Telegraf v1.7.4
2018-09-17T02:43:08Z I! Loaded inputs: inputs.redis inputs.mem
2018-09-17T02:43:08Z I! Loaded aggregators:
2018-09-17T02:43:08Z I! Loaded processors:
2018-09-17T02:43:08Z I! Loaded outputs: influxdb
2018-09-17T02:43:08Z I! Tags enabled: host=telegraf
2018-09-17T02:43:08Z I! Agent Config: Interval:10s, Quiet:false, Hostname:”telegraf “, Flush Interval:10s

此时,influxdb会受到请求:

2018-09-17T02:43:08.060799Z info Executing query {“log_id”: “0AaMBDO0000”, “service”: “query”, “query”: “CREATE DATABASE telegraf”}
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:08 +0000] “POST /query HTTP/1.1” 200 57 “-” “telegraf” 68dafe05-ba23-11e8-8001-000000000000 108642
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:20 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 7026fecd-ba23-11e8-8002-000000000000 595855
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:30 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 761ceb12-ba23-11e8-8003-000000000000 149522
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:40 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 7c12cd50-ba23-11e8-8004-000000000000 326783
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:50 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 820892ba-ba23-11e8-8005-000000000000 101009
[httpd] 127.0.0.1 – – [17/Sep/2018:02:44:00 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 87fe77d9-ba23-11e8-8006-000000000000 86017
[httpd] 127.0.0.1 – – [17/Sep/2018:02:44:10 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 8df464b0-ba23-11e8-8007-000000000000 85689

通过influxdb的client命令就可以查询到收集到的信息了,非常简单方便:

[root@influx ~]# influx
Connected to http://localhost:8086 version 1.6.2
InfluxDB shell version: 1.6.2
> show databases
name: databases
name
—-
_internal
telegraf
> use telegraf
Using database telegraf
>
> show measurements
name: measurements
name
—-
mem
redis

> select * from redis limit 1;
name: redis
time aof_current_rewrite_time_sec aof_enabled aof_last_bgrewrite_status aof_last_rewrite_time_sec aof_last_write_status aof_rewrite_in_progress aof_rewrite_scheduled blocked_clients client_biggest_input_buf client_longest_output_list clients cluster_enabled connected_slaves evicted_keys expired_keys host instantaneous_input_kbps instantaneous_ops_per_sec instantaneous_output_kbps keyspace_hitrate keyspace_hits keyspace_misses latest_fork_usec loading lru_clock master_repl_offset maxmemory maxmemory_policy mem_fragmentation_ratio migrate_cached_sockets port pubsub_channels pubsub_patterns rdb_bgsave_in_progress rdb_changes_since_last_save rdb_current_bgsave_time_sec rdb_last_bgsave_status rdb_last_bgsave_time_sec rdb_last_save_time rdb_last_save_time_elapsed redis_version rejected_connections repl_backlog_active repl_backlog_first_byte_offset repl_backlog_histlen repl_backlog_size replication_role server slave0 sync_full sync_partial_err sync_partial_ok total_commands_processed total_connections_received total_net_input_bytes total_net_output_bytes total_system_memory uptime used_cpu_sys used_cpu_sys_children used_cpu_user used_cpu_user_children used_memory used_memory_lua used_memory_peak used_memory_rss
—- —————————- ———– ————————- ————————- ——————— ———————– ——————— ————— ———————— ————————– ——- ————— —————- ———— ———— —- ———————— ————————- ————————- —————- ————- ————— —————- ——- ——— —————— ——— —————- ———————– ———————- —- ————— ————— ———————- ————————— ————————— ———————- ———————— —————— ————————– ————- ——————– ——————- —————————— ——————– —————– —————- —— —— ——— —————- ————— ———————— ————————– ——————— ———————- ——————- —— ———— ——————— ————- ———————- ———– ————— —————- —————
1537152190000000000 -1 0 ok -1 ok 0 0 0 0 0 41 1 1 0 778 telegraf 0.09 2 0.01 1 188 0 379 0 10425533 16473380 8000000000 allkeys-lru 1.17 0 7001 0 0 0 856 -1 ok 1 1530088772 7063418 3.2.8 0 1 15424805 1048576 1048576 master 10.224.91.231 ip=10.224.91.234,port=7001,state=online,offset=16473380,lag=1 2 0 0 19620365 1239692 500589135 885305642 33670017024 11549541 15528.8 0 8857.04 0 4476504 37888 5601248 5259264
>

select * from mem limit 1;
name: mem
time active available available_percent buffered cached free host inactive slab total used used_percent wired
—- —— ——— —————– ——– —— —- —- ——– —- —– —- ———— —–
1537152190000000000 771219456 7859949568 93.83099562612006 422666240 890130432 6547152896 telegraf 860303360 142872576 8376709120 516759552 6.169004373879942 0
>
>

grafana

1 install

注意安装要求64位机器:

a. 创建grafana安装源 /etc/yum.repos.d/grafana.repo

[grafana]
name=grafana
baseurl=https://packagecloud.io/grafana/stable/el/7/$basearch
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packagecloud.io/gpg.key https://grafanarel.s3.amazonaws.com/RPM-GPG-KEY-grafana
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

2. 安装和启动

$ sudo yum install grafana
$ sudo service grafana-server start

 

启动后,默认HTTP port 是3000, 默认用户和用户组是admin.

加入启动时运行列表:

$ sudo /sbin/chkconfig --add grafana-server

3. 使用

a 创建数据源: 数据源支持很多种,例如常见的influxdb,elastic search和mysql等等。

b 创建dashboard, 要点就是选择步骤1创建的数据源,然后绘制各种图形。

上面2步即可完成基本操作,然后可以基于绘制的数据创建alert,不做赘述。

ELKK