faults on coding (to be continued)

零零散散的在编码过程中犯过一些错误,所以打算专门写一篇总结,持续记录这些错误,以便知错能改。

1 work一直良好的系统中,明显的错误不要改,例如:

private int age;
public void setAge(int age){
 age = age;
}

有些功能实际上错上加错的结果,当纠正一个错误的时候,结果就变成了错误。

2 理所当然的思维惯性。

下面的结果是什么?是不是true?

	Properties properties = new Properties();
        properties.put("key",true); 
        
	String valueStr = properties.getProperty("key");
	System.out.println(valueStr);
	System.out.println(Boolean.TRUE.toString().equalsIgnoreCase(valueStr));

源码分析:
实际上设置的是boolean型的值,然后getProperty返回的是null:

	    public String getProperty(String key) {
	        Object oval = super.get(key);
	        String sval = (oval instanceof String) ? (String)oval : null;
	        return ((sval == null) && (defaults != null)) ? defaults.getProperty(key) : sval;
	    }

3 针对接口编程和误用方法:

下面的写法试图cache一个值,当不存在对应的value时,自动加载一个值,那下面的值返回什么?

cache loader

  		CacheLoader<String, String> loader = new CacheLoader<String, String>() {

			@Override
			public String load(String key) throws Exception {
 				return key + "'s value";
			}
		};

使用:

		Cache<String, String> cache = CacheBuilder.newBuilder().maximumSize(10_000).expireAfterWrite(15, TimeUnit.MINUTES).build(loader);
  		String value = cache.getIfPresent("key1");

实际正确的写法

  		LoadingCache<String,String> cache = CacheBuilder.newBuilder().maximumSize(10_000).expireAfterWrite(15, TimeUnit.MINUTES).build(loader);
   		String value = cache.get("key1");

问题出在,一定要仔细阅读API的文档,不要想当然,同时针对接口编程,不定是针对顶层接口编程,如果上来就赋予给顶层接口,则后面的方法选择范围就比较小,可能就不假思考。

4 不假思索的认为某种类型是枚举

在判断响应是否是JSON body时,误以为MediaType.APPLICATION_JSON_TYPE肯定被定义成枚举,所以直接用==判断。

 MediaType.APPLICATION_JSON_TYPE == response.getMediaType();

修改:不定是每个感觉应该定义成枚举类型的就会定义成枚举类型,另外没有搞清楚状况前,用equals肯定比==更安全,

MediaType.APPLICATION_JSON_TYPE.equals(response.getMediaType()) 

5 Arrays.aslist返回的list?

下面的代码能编译,但是有没有问题?

		List<String> asList = Arrays.asList("123", "456");
		asList.removeIf(string -> string.equalsIgnoreCase("123"));

结果:

Exception in thread "main" java.lang.UnsupportedOperationException
	at java.util.AbstractList.remove(AbstractList.java:161)
	at java.util.AbstractList$Itr.remove(AbstractList.java:374)
	at java.util.Collection.removeIf(Collection.java:415)
	at com.github.metrics.Metric.main(Metric.java:29)

源码解析:

    @SuppressWarnings("varargs")
    public static <T> List<T> asList(T... a) {
        return new ArrayList<>(a); //此处返回的list是java.util.Arrays.ArrayList<E>,而并不是普通的java.util.ArrayList<E>
    }

而这种list的一些实现并未实现:

    public E remove(int index) {
        throw new UnsupportedOperationException();
    }

思考:可以调用的方法不见得可以work,返回看起来名字一样的,但是不见得是一个。

http protocol (1) – different types of timeout

项目中给http client 4.x设置了各种timeout,但是还是发现一些请求耗费的时间远超过预期,排查一部分请求是由于虚拟机或者GC等情况引起,还有一部分长耗时请求归咎于DNS解析的耗时,所以有必要梳理下http client使用中的各种timeout设置,从而对一个请求完成的最大时间有个正确而清晰的认识:

分析http client中,一次http请求的流程大致有2条主线如下:

(1)从连接池获取到连接->发送http请求->接受http响应

(2)从连接池获取连接,但是没有可用的,此时又可以创建新连接情况下-> DNS解析->建立连接-> TLS握手(假设是https)->发送http请求->接受http响应。

分解上面的2条主线,至少有4种timeout(socket timeout, connection timeout, request connection timeout, dns resolve timeout)影响了一次http请求的完成时间。

(1)从连接池获取连接的最长等待时间
通过复用连接池的连接,可以避免不断频繁开关连接,提高效率,所以从连接池获取连接时,当连接池满时,可以设置等待一定的时间来获取连接,而不是永久等待。

1.1 错误堆栈:

  org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
! at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) 
! at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) 

1.2 源码解析:

   @Override
    public ConnectionRequest requestConnection(
            final HttpRoute route,
            final Object state) {
        Args.notNull(route, "HTTP route");
        if (this.log.isDebugEnabled()) {
            this.log.debug("Connection request: " + format(route, state) + formatStats(route));
        }
        final Future<CPoolEntry> future = this.pool.lease(route, state, null);  //从连接池获取连接
        return new ConnectionRequest() {

            @Override
            public boolean cancel() {
                return future.cancel(true);
            }

            @Override
            public HttpClientConnection get(
                    final long timeout,
                    final TimeUnit tunit) throws InterruptedException, ExecutionException, ConnectionPoolTimeoutException {
                return leaseConnection(future, timeout, tunit);
            }

        };

    }
    protected HttpClientConnection leaseConnection(
            final Future<CPoolEntry> future,
            final long timeout,
            final TimeUnit tunit) throws InterruptedException, ExecutionException, ConnectionPoolTimeoutException {
        final CPoolEntry entry;
        try {
            entry = future.get(timeout, tunit);
            if (entry == null || future.isCancelled()) {
                throw new InterruptedException();
            }
            Asserts.check(entry.getConnection() != null, "Pool entry with no connection");
            if (this.log.isDebugEnabled()) {
                this.log.debug("Connection leased: " + format(entry) + formatStats(entry.getRoute()));
            }
            return CPoolProxy.newProxy(entry);
        } catch (final TimeoutException ex) {
            throw new ConnectionPoolTimeoutException("Timeout waiting for connection from pool");
        }
    }

1.3 设置方法:

	RequestConfig requestConfig = RequestConfig.custom().
 				setConnectionRequestTimeout(2 * 1000).

(2)DNS解析时间

拿到域名建立连接之后,需要进行dns解析,一般dns解析都会设置cache时间,例如5分钟,但是为了防止绑死在某个主机上,都不会永久cache,所以dns解析不可避免,然而dns解析的时间控制除了本身将这个过程异步化外并未提供任何可以设置的时间参数。

2.1 堆栈错误:

Caused by: java.net.UnknownHostException: nebulaik.webex.com.cn: Temporary failure in name resolution
        at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at com.webex.dsagent.client.http.DSADnsResolver.resolve(DSADnsResolver.java:24)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:111)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)

2.2 源码分析:

大多请求遇到这种错误,都耗时15秒,原因在于系统配置(/etc/resolv.conf ):

options rotate timeout:5 retries:3
nameserver 10.224.91.88
nameserver 10.224.91.99

诊断方法可以使用dig命令:

dig命令提供了2个参数:

+time=T
Sets the timeout for a query to T seconds. The default timeout is 5 seconds. An attempt to set T to less than 1 will result in a query timeout of 1 second being applied.
+tries=T
Sets the number of times to try UDP queries to server to T instead of the default, 3. If T is less than or equal to zero, the number of tries is silently rounded up to 1.

例子:

[root@centos~]# time dig www.baidu.com +time=1

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.4 <<>> www.baidu.com +time=1
;; global options: +cmd
;; connection timed out; no servers could be reached

real  0m3.010s
user 0m0.003s
sys  0m0.007s
[root@centos~]# time dig www.baidu.com +time=5

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.4 <<>> www.baidu.com +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached

real  0m15.013s
user 0m0.006s
sys  0m0.005s

另外一个追踪方法和问题实例:解析aaa.com.cn超时了。

dig +trace aaa.com.cn //示例

使用dig +trace定位到:

aaa.com.cn. 600 IN CNAME bbb.com.cn.  //别名
bbb.com.cn. 120 IN NS ns1.dns.com.cn.
bbb.com.cn. 120 IN NS ns2.dns.com.cn.

其中ns1.dns.com.cn和ns2.dns.com.cn负责解析这个域名,但是做了环境迁移,导致不可达。后来修改下这2条记录,就不再超时了。

3.3 设置方法:

要不修改配置文件(options rotate timeout:5 retries:3),要不将dns解析异步化。

附上各种DNS错误:
https://zhuanlan.zhihu.com/p/40659713

(3)连接建立时间

3.1 错误的堆栈:

Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to demosite.yy.zz:443 failed: connect timed out
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.jboss.resteasy.client.jaxrs.engines.ApacheHttpClient4Engine.invoke(ApacheHttpClient4Engine.java:312)
        ... 69 more
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:337)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
        ... 79 more

3.2 源码解析:

    void socketConnect(InetAddress address, int port, int timeout)
        throws IOException {
        int nativefd = checkAndReturnNativeFD();
 
        int connectResult;
        if (timeout <= 0) {  //不设置timeout的时间,永久阻塞
            connectResult = connect0(nativefd, address, port);
        } else {  //设置timeout时间,只等timeout时间。
            configureBlocking(nativefd, false);  //先设置为非阻塞模式
            try {
                connectResult = connect0(nativefd, address, port); //做个连接
                if (connectResult == WOULDBLOCK) { 
                    waitForConnect(nativefd, timeout); //最多等待阻塞timoout时间
                }
            } finally {
                configureBlocking(nativefd, true);  //设置回阻塞模式
            }
        }
 
    }
 

3.3 设置方法:

  		RequestConfig requestConfig = RequestConfig.custom().
 		setConnectTimeout(2 * 1000).

(4)数据处理时间:

发送完请求后,等待响应需要一定的时间,这个时间即为数据处理时间。除此之外,建立连接后的tls握手过程,也有多个请求响应过程,这个时间参数对这个交互过程也适用。

4.1 错误堆栈:

Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
        at sun.security.ssl.InputRecord.read(InputRecord.java:503)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)

4.2 源码解析:

http发完请求后,立马阻塞读取响应:

    public HttpResponse execute(
            final HttpRequest request,
            final HttpClientConnection conn,
            final HttpContext context) throws IOException, HttpException {
        try {
            HttpResponse response = doSendRequest(request, conn, context);
            if (response == null) {
                response = doReceiveResponse(request, conn, context); //阻塞读取响应
            }
            return response;
        }
    /**
     *  Enable/disable {@link SocketOptions#SO_TIMEOUT SO_TIMEOUT}
     *  with the specified timeout, in milliseconds. With this option set
     *  to a non-zero timeout, a read() call on the InputStream associated with
     *  this Socket will block for only this amount of time.  If the timeout
     *  expires, a <B>java.net.SocketTimeoutException</B> is raised, though the
     *  Socket is still valid. The option <B>must</B> be enabled
     *  prior to entering the blocking operation to have effect. The
     *  timeout must be {@code > 0}.
     *  A timeout of zero is interpreted as an infinite timeout.
     *
     * @param timeout the specified timeout, in milliseconds.
     * @exception SocketException if there is an error
     * in the underlying protocol, such as a TCP error.
     * @since   JDK 1.1
     * @see #getSoTimeout()
     */
    public synchronized void setSoTimeout(int timeout) throws SocketException {  //仅仅对读有效
        if (isClosed())
            throw new SocketException("Socket is closed");
        if (timeout < 0)
          throw new IllegalArgumentException("timeout can't be negative");

        getImpl().setOption(SocketOptions.SO_TIMEOUT, new Integer(timeout));
    }
    /**
     * Reads into an array of bytes at the specified offset using
     * the received socket primitive.
     * @param fd the FileDescriptor
     * @param b the buffer into which the data is read
     * @param off the start offset of the data
     * @param len the maximum number of bytes read
     * @param timeout the read timeout in ms
     * @return the actual number of bytes read, -1 is
     *          returned when the end of the stream is reached.
     * @exception IOException If an I/O error has occurred.
     */
    private native int socketRead0(FileDescriptor fd,
                                   byte b[], int off, int len,
                                   int timeout) //阻塞读取timeout时间

4.3 设置方法:

  	RequestConfig requestConfig = RequestConfig.custom().
 	setSocketTimeout(requestTimeoutInSecond * 1000).

假设不设置so_timeout的同时,设置了上文(3)中提到的connection timeout,则so_timeout = connection timeout

org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(int, Socket, HttpHost, InetSocketAddress, InetSocketAddress, HttpContext)
            if (connectTimeout > 0 && sock.getSoTimeout() == 0) {
                sock.setSoTimeout(connectTimeout);
            }

总结:
除了以上几种直接影响处理耗时的timeout参数为,尝试机制是另外一个影响处理耗时的机制,而且基本都是“翻倍”,这里不做赘述。所以从这个角度看,影响一次请求的耗时,主要包括以下因素:
(1) GC的STW效应
(2) 虚拟机的影响
(3) 应用层/系统层各种timeout参数
(4) 应用层/系统层重试次数

通过上文可知,除非将一个请求过程完全异步化,否则必须将所有的参数都了然于心才能很有信心的了解一个请求到底最长需要多久来完成,但是异步化的缺点就是管理的复杂性、缺乏负反馈等问题,需要综合权衡。

metric driven (6) – common arch solutions

TIG:

telegraf

1 install

1.1 create /etc/yum.repos.d/influxdb.repo:

[influxdb]
name = InfluxDB Repository – RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key

1.2 sudo yum install telegraf

1.3 startup

sudo service telegraf start
Or if your operating system is using systemd (CentOS 7+, RHEL 7+):
sudo systemctl start telegraf

2 Config:

默认配置文件为/etc/telegraf/telegraf.conf,也可以查看https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf, telegraf是通过输入、转化,输出插件方式来管理的。

所以默认什么都不做修改时的,telegraf收集的是如下信息:

inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu

而输出采用的是influxdb方式。这点可以通过启动日志来观察到:

2018/09/17 01:31:19 I! Using config file: /etc/telegraf/telegraf.conf
2018-09-17T01:31:19Z W! [outputs.influxdb] when writing to [http://localhost:8086]: database “telegraf” creation failed: Post http://localhost:8086/query: dial tcp 127.0.0.1:8086: connect: connection refused
2018-09-17T01:31:19Z I! Starting Telegraf v1.7.4
2018-09-17T01:31:19Z I! Loaded inputs: inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu
2018-09-17T01:31:19Z I! Loaded aggregators:
2018-09-17T01:31:19Z I! Loaded processors:
2018-09-17T01:31:19Z I! Loaded outputs: influxdb
2018-09-17T01:31:19Z I! Tags enabled: host=appOne
2018-09-17T01:31:19Z I! Agent Config: Interval:10s, Quiet:false, Hostname:”telegraf”, Flush Interval:10s
2018-09-17T01:31:30Z E! [outputs.influxdb]: when writing to [http://localhost:8086]: Post http://localhost:8086/write?db=telegraf: dial tcp 127.0.0.1:8086

所以如果需要修改或者定制可以直接修改/etc/telegraf/telegraf.conf达到目标,但是默认配置里面有太多冗余插件信息去注释掉,所以telegraf提供了一种简洁的方式来产生配置文件。

#telegraf –input-filter redis:cpu:mem:net:swap –output-filter influxdb:kafka config //采集多个指标
#telegraf –input-filter redis –output-filter influxdb config //采集一个指标

例如,产生一个redis.conf的配置:

#telegraf -sample-config -input-filter redis:mem -output-filter influxdb > redis.conf

产生后的配置内容如下:

###############################################################################
# INPUT PLUGINS #
###############################################################################

# Read metrics about memory usage
[[inputs.mem]]
# no configuration

[[inputs.redis]]
## specify servers via a url matching:
## [protocol://][:password]@address[:port]
## e.g.
## tcp://localhost:6379
## tcp://:password@192.168.99.100
## unix:///var/run/redis.sock
##
## If no servers are specified, then localhost is used as the host.
## If no port is specified, 6379 is used
servers = [“tcp://localhost:6379”]

###############################################################################
# OUTPUT PLUGINS #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
## The full HTTP or UDP URL for your InfluxDB instance.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
# urls = [“unix:///var/run/influxdb.sock”]
# urls = [“udp://127.0.0.1:8089”]
# urls = [“http://127.0.0.1:8086”]

## The target database for metrics; will be created as needed.
# database = “telegraf”
# username = “telegraf”
# password = “metricsmetricsmetricsmetrics”

然后以这个文件作为启动配置文件启动:

#telegraf –config /etc/telegraf/redis.conf

[root@telegraf ~]# telegraf –config /etc/telegraf/redis.conf
2018-09-17T02:43:08Z I! Starting Telegraf v1.7.4
2018-09-17T02:43:08Z I! Loaded inputs: inputs.redis inputs.mem
2018-09-17T02:43:08Z I! Loaded aggregators:
2018-09-17T02:43:08Z I! Loaded processors:
2018-09-17T02:43:08Z I! Loaded outputs: influxdb
2018-09-17T02:43:08Z I! Tags enabled: host=telegraf
2018-09-17T02:43:08Z I! Agent Config: Interval:10s, Quiet:false, Hostname:”telegraf “, Flush Interval:10s

此时,influxdb会受到请求:

2018-09-17T02:43:08.060799Z info Executing query {“log_id”: “0AaMBDO0000”, “service”: “query”, “query”: “CREATE DATABASE telegraf”}
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:08 +0000] “POST /query HTTP/1.1” 200 57 “-” “telegraf” 68dafe05-ba23-11e8-8001-000000000000 108642
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:20 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 7026fecd-ba23-11e8-8002-000000000000 595855
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:30 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 761ceb12-ba23-11e8-8003-000000000000 149522
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:40 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 7c12cd50-ba23-11e8-8004-000000000000 326783
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:50 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 820892ba-ba23-11e8-8005-000000000000 101009
[httpd] 127.0.0.1 – – [17/Sep/2018:02:44:00 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 87fe77d9-ba23-11e8-8006-000000000000 86017
[httpd] 127.0.0.1 – – [17/Sep/2018:02:44:10 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 8df464b0-ba23-11e8-8007-000000000000 85689

通过influxdb的client命令就可以查询到收集到的信息了,非常简单方便:

[root@influx ~]# influx
Connected to http://localhost:8086 version 1.6.2
InfluxDB shell version: 1.6.2
> show databases
name: databases
name
—-
_internal
telegraf
> use telegraf
Using database telegraf
>
> show measurements
name: measurements
name
—-
mem
redis

> select * from redis limit 1;
name: redis
time aof_current_rewrite_time_sec aof_enabled aof_last_bgrewrite_status aof_last_rewrite_time_sec aof_last_write_status aof_rewrite_in_progress aof_rewrite_scheduled blocked_clients client_biggest_input_buf client_longest_output_list clients cluster_enabled connected_slaves evicted_keys expired_keys host instantaneous_input_kbps instantaneous_ops_per_sec instantaneous_output_kbps keyspace_hitrate keyspace_hits keyspace_misses latest_fork_usec loading lru_clock master_repl_offset maxmemory maxmemory_policy mem_fragmentation_ratio migrate_cached_sockets port pubsub_channels pubsub_patterns rdb_bgsave_in_progress rdb_changes_since_last_save rdb_current_bgsave_time_sec rdb_last_bgsave_status rdb_last_bgsave_time_sec rdb_last_save_time rdb_last_save_time_elapsed redis_version rejected_connections repl_backlog_active repl_backlog_first_byte_offset repl_backlog_histlen repl_backlog_size replication_role server slave0 sync_full sync_partial_err sync_partial_ok total_commands_processed total_connections_received total_net_input_bytes total_net_output_bytes total_system_memory uptime used_cpu_sys used_cpu_sys_children used_cpu_user used_cpu_user_children used_memory used_memory_lua used_memory_peak used_memory_rss
—- —————————- ———– ————————- ————————- ——————— ———————– ——————— ————— ———————— ————————– ——- ————— —————- ———— ———— —- ———————— ————————- ————————- —————- ————- ————— —————- ——- ——— —————— ——— —————- ———————– ———————- —- ————— ————— ———————- ————————— ————————— ———————- ———————— —————— ————————– ————- ——————– ——————- —————————— ——————– —————– —————- —— —— ——— —————- ————— ———————— ————————– ——————— ———————- ——————- —— ———— ——————— ————- ———————- ———– ————— —————- —————
1537152190000000000 -1 0 ok -1 ok 0 0 0 0 0 41 1 1 0 778 telegraf 0.09 2 0.01 1 188 0 379 0 10425533 16473380 8000000000 allkeys-lru 1.17 0 7001 0 0 0 856 -1 ok 1 1530088772 7063418 3.2.8 0 1 15424805 1048576 1048576 master 10.224.91.231 ip=10.224.91.234,port=7001,state=online,offset=16473380,lag=1 2 0 0 19620365 1239692 500589135 885305642 33670017024 11549541 15528.8 0 8857.04 0 4476504 37888 5601248 5259264
>

select * from mem limit 1;
name: mem
time active available available_percent buffered cached free host inactive slab total used used_percent wired
—- —— ——— —————– ——– —— —- —- ——– —- —– —- ———— —–
1537152190000000000 771219456 7859949568 93.83099562612006 422666240 890130432 6547152896 telegraf 860303360 142872576 8376709120 516759552 6.169004373879942 0
>
>

grafana

1 install

注意安装要求64位机器:

a. 创建grafana安装源 /etc/yum.repos.d/grafana.repo

[grafana]
name=grafana
baseurl=https://packagecloud.io/grafana/stable/el/7/$basearch
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packagecloud.io/gpg.key https://grafanarel.s3.amazonaws.com/RPM-GPG-KEY-grafana
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

2. 安装和启动

$ sudo yum install grafana
$ sudo service grafana-server start

 

启动后,默认HTTP port 是3000, 默认用户和用户组是admin.

加入启动时运行列表:

$ sudo /sbin/chkconfig --add grafana-server

3. 使用

a 创建数据源: 数据源支持很多种,例如常见的influxdb,elastic search和mysql等等。

b 创建dashboard, 要点就是选择步骤1创建的数据源,然后绘制各种图形。

上面2步即可完成基本操作,然后可以基于绘制的数据创建alert,不做赘述。

ELKK

common performance tool and solution

之前一直想把经常搞的性能测试的公共部分(压力控制部分)抽取出来作为一个公共的部分(jar),这样一方面能让开发测试者都集中在”测试性能的case”编写上,另外一个方面使用同一标准和同一实现有利于”团队”内部标准化。

实际上,现在大多直接使用jmeter来控制压力,也能达到效果,但是jmeter本身到底如何控制的,不去熟读代码很难理解,实际使用中,假设case需要调用java代码等时,还需要学习bean shell等,所以总结起来就是自由度不够大,不够透明,所以试用一段时间后,觉得不如自己实现一套,自由度大的,更广泛通用可控的,于是有了:

https://github.com/jiafu1115/performance-test-tool

直接看如何使用(基本使用方式):

compile exec:java -Dexec.mainClass="com.test.performance.PerfTool" -Dexec.args="-t com.test.performance.demo.DemoTestCaseImpl -duration 20 -thread 5 -tps 30"

(1)控制3个参数:1 持续多久 -duration 20 2 使用多少线程 -thread 5 3 TPS期待多少 -tps 30 实际使用,可以只指定线程数,让每个线程loop去发,也可以单独设置tps不设置线程数来尽量达到预期TPS.
(2)提供2种方式:1 测试Case实现类:-t com.test.performance.demo.DemoTestCaseImpl 2 收集测试结果类: -r com.test.performance.result.impl.InfluxdbCollectMethodImpl或自己提供
(3)提供3种运行信息:1 -program MyProgramName 2 -testname TestWebService 3 -runid ThisRunId
(4)提供4种case辅助: 1 before test 2 after test 3 prepare environment 4 destroy environment.

这样基本完成单机压力控制和实现,然后默认提供了influxdb的收集结果的方式和日志输出的方式可供选择,从而使用者只需要专注用例实现和结果收集即可。

结合这个单机的压力控制,还要完成三件事情:

(1) 并发控制: 可以采用jenkins的multi config项目来控制多个机器并发。

效果图:

(2) 结果收集: 可以采用influxdb等来收集,同时需要收集被测试机器的性能,可以在机器上部署collectd,然后发到influxdb,这样数据结果包含2个部分:性能测试数据和系统性能。
(3) 结果分析: 可以直接使用grafana来展示即可,而对于server的数据收集可采用collectd + grafana.

效果图:

结果应该至少提供3个维度:
(1)测试的性能数据, TPS, 响应时间(分布), 成功率
(2)被测机器的系统性能: cpu, memory, io, etc
(3)被测应用的性能数据: TPS, 响应时间(分布),成功率

总结: 经过剥离变化,就解决了共同的问题,然后使得性能测试者只关注自身测试用例和测试结果的收集和展示,这样就轻松了许多。

附:

1 使用的组件的安装:

1.1 influxdb 安装:

wget https://dl.influxdata.com/influxdb/releases/influxdb-1.5.0.x86_64.rpm
sudo yum localinstall influxdb-1.5.0.x86_64.rpm
service influxdb start

1.1版本后无web界面了,别找了。

1.2 grafana 安装

wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.0.3-1.x86_64.rpm
sudo yum localinstall grafana-5.0.3-1.x86_64.rpm
service grafana-server start

1.3 collectd 安装

wget http://mirrors.163.com/.help/CentOS6-Base-163.repo
yum install epel-release
yum install collectd
service collectd start

2. 使用的组件的配置:

2.1 influxdb + collectd收集系统信息需要的配置:

influxdb配置:

开启collectd数据收集:

[[collectd]]
    enabled = true
    bind-address = ":25826"
    database = "collectd"

启动会报错: /usr/share/collectd/types.db
所以influx上也要装上collectd可以解决这个问题。

2.2 collectd配置: server指向influxdb

Hostname "10.224.82.92"
 
Interval 2
ReadThreads 5

LoadPlugin cpu
LoadPlugin load
LoadPlugin memory
LoadPlugin swap
LoadPlugin battery

LoadPlugin network
<Plugin "network">
Server "10.224.2.147" "25826"
</Plugin>