about Cassandra timeout issues

最近半年,应用程序使用Cassandra的timeout问题更加明显了:类型增多、次数增多,虽然平均三、四周才发生一次,但是已经到了必须解决问题(满足下好奇心)的时间了,对于数据库的使用而言,一般都是当“黑盒子”来使用,而且因为归属于基础设施层面,在设计的时候,也都严格控制了日志的输出,这都给查找问题带来了很大难度,所以以下综合发生问题的类型,来回顾记录追查的过程和手段。(以下所有讨论基于最常见的结构:replication factor: 3. read/write consistency level: local_quarm)

从报错的类型看:

(1)Cassandra返回的错误:写超时

Cassandra timeout during write query at consistency LOCAL_QUORUM (2 replica were required but only 1 acknowledged the write)

(2)Cassandra返回的错误:读超时

Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)

(3)客户端返回的错误:超时,没有接收到cassandra的操作结果。

java.util.concurrent.TimeoutException: Timeout waiting for task.

首先分析问题(1):出现timeout的原因,需要至少写两份数据的节点(一共写三份)返回,但是实际上等了timeout(默认2000ms)时间后,只等到了其中一个节点。那么为什么没有等到呢?任何一件事的处理都包括两个阶段:等待和执行。那排除执行本身时间太久:例如本身CQL写操作的数据内容或操作就很大,则很大原因可能在于处理的任务等的太久,以致于抛弃。其实这2个因素之间也是互相影响的,例如,单个执行慢,那别人等待的时间会增大。而单个执行对于cassandra而言主要就是写内存+写commit log,这个问题的定位可查看关键词,看当时是否有任务丢了。


    public void get() throws WriteTimeoutException, WriteFailureException
    {
        long timeout = currentTimeout();

        boolean success;
        try
        {
            success = condition.await(timeout, TimeUnit.NANOSECONDS);
        }
        catch (InterruptedException ex)
        {
            throw new AssertionError(ex);
        }

        if (!success)
        {
            int blockedFor = totalBlockFor();
            int acks = ackCount();
            // It's pretty unlikely, but we can race between exiting await above and here, so
            // that we could now have enough acks. In that case, we "lie" on the acks count to
            // avoid sending confusing info to the user (see CASSANDRA-6491).
            if (acks <= blockedFor)
                acks = blockedFor - 1;
            throw new WriteTimeoutException(writeType, consistencyLevel, acks, blockedFor);
        }

        if (totalBlockFor() + failures < totalEndpoints())
        {
            throw new WriteFailureException(consistencyLevel, ackCount(), totalBlockFor(), writeType, failureReasonByEndpoint);
        }
    }

    public final long currentTimeout()
    {
        long requestTimeout = writeType == WriteType.COUNTER
                              ? DatabaseDescriptor.getCounterWriteRpcTimeout()
                              : DatabaseDescriptor.getWriteRpcTimeout();
        return TimeUnit.MILLISECONDS.toNanos(requestTimeout) - (System.nanoTime() - queryStartNanoTime);
    }

[root@cassandra001 logs]# zgrep -i 'messages were dropped'  system.log* |grep -i MUTATION

system.log.11.zip:INFO [ScheduledTasks:1] 2019-01-24 06:02:52,243  MessagingService.java:929 - MUTATION messages were dropped in last 5000 ms: 9641 for internal timeout and 0 for cross node timeout

源码分析:

   org.apache.cassandra.net.MessageDeliveryTask:
   public void run()
    {
        MessagingService.Verb verb = message.verb;
 
        long timeTaken = message.getLifetimeInMS();  //ApproximateTime.currentTimeMillis() - constructionTime,即等待时间:开始处理时间减去构建时间
        if (MessagingService.DROPPABLE_VERBS.contains(verb) && timeTaken > getTimeout()) //不同类型的操作,超时时间不同,参考MessagingService.Verb
        {
            MessagingService.instance().incrementDroppedMessages(message, timeTaken);
            return;
        }

各种类型操作的超时时间:org.apache.cassandra.net.MessagingService.Verb

   public enum Verb
    {
        MUTATION
        {
            public long getTimeout()
            {
                return DatabaseDescriptor.getWriteRpcTimeout(); //  public volatile long write_request_timeout_in_ms = 2000L;
            }
        },
         READ_REPAIR
        {
            public long getTimeout()
            {
                return DatabaseDescriptor.getWriteRpcTimeout(); //  public volatile long write_request_timeout_in_ms = 2000L;
            }
        },
        READ
        {
            public long getTimeout()
            {
                return DatabaseDescriptor.getReadRpcTimeout(); //  public volatile long read_request_timeout_in_ms = 5000L;

            }
        },

手段:

1 查GC影响
首先看到这个问题,第一想到的是GC的影响,除了jmx本身的展示外,还有两种手段:

(1)nodetool gcstats: 查看“Max GC Elapsed (ms)”就知道到底最大影响多久。

[root@cassandra001 ~]# nodetool gcstats
Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms) GC Reclaimed (MB) Collections
9497938881                   632                11415528 28 545351003897880 125586

(2) gc日志:会提醒说查时间点是否发生gc,以及gc花了多久。


INFO;[Service Thread] 2019-01-16 06:03:31,977 ;GCInspector.java:258 - G1 Young Generation GC in 230ms.&amp;nbsp; G1 Eden Space: 9705619456 --> 0; G1 Old Gen: 3363831808 -> 3489807376; G1 Survivor Space: 595591168 -> 1291845632;

2 查表的性能数据:
执行:

 nodetool cfhistograms <keyspace> <table>
 例如:nodetool cfhistograms keyspace_xxx table_yyy

结果:

keyspace_xxx/table_yyy histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             0.00            372.00              0.00               642                12
75%             0.00           2759.00              0.00              2759                50
95%             0.00          20501.00              0.00             35425               642
98%             0.00          42510.00              0.00            105778              2299
99%             0.00          73457.00              0.00            263210              5722
Min             0.00              6.00              0.00               536                11
Max             0.00         545791.00              0.00          30130992            654949

3 查磁盘性能

refer to http://www.strolling.cn/2019/03/performance-turning-disk/

sdc               0.00    40.23    0.99    2.12   144.29   338.80   155.52     0.02    7.68    0.36   11.09   0.21   0.06
sdc1              0.00    40.23    0.99    2.12   144.29   338.80   155.52     0.02    7.68    0.36   11.09   0.21   0.06
sdb               0.00     0.00    0.00    0.00     0.00     0.00    24.89     0.00    0.14    0.14    0.00   0.14   0.00
sde               0.00    80.05    0.20    2.97    16.89   664.23   214.30     0.02    7.20    0.39    7.66   0.34   0.11
sde1              0.00    19.64    0.20    1.03    16.88   165.34   147.84     0.01    9.42    0.38   11.21   0.22   0.03
sde2              0.00    60.42    0.00    1.95     0.00   498.89   256.40     0.01    5.79    3.80    5.79   0.42   0.08
sdd               0.00    38.47    1.80    2.01   237.10   323.88   147.09     0.02    5.95    0.35   10.95   0.20   0.07
sdd1              0.00    38.47    1.80    2.01   237.10   323.88   147.09     0.02    5.95    0.35   10.95   0.20   0.07

4 查日志:

除了丢数据本身那条日志(文上)外,当丢数据(MessagingService.logDroppedMessages)或者gc(GCInspector.handleNotification(Notification, Object))时,cassandra会打印出工作状态:包括各种queue的size情况:

INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,276  StatusLogger.java:51 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,276  StatusLogger.java:66 - MutationStage                    32      6686     1613266420         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,276  StatusLogger.java:66 - ReadStage                         0         0     1261991952         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,276  StatusLogger.java:66 - RequestResponseStage              0         0     2755306099         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,276  StatusLogger.java:66 - ReadRepairStage                   0         0        3046997         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,276  StatusLogger.java:66 - CounterMutationStage              0         0              0         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,276  StatusLogger.java:66 - MiscStage                         0         0              0         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - HintedHandoff                     0         0           1268         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - GossipStage                       0         0       22851491         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - CacheCleanupExecutor              0         0              0         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - InternalResponseStage             0         0              0         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - CommitLogArchiver                 0         0          65246         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - ValidationExecutor                0         0         286603         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - MigrationStage                    0         0             70         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,277  StatusLogger.java:66 - CompactionExecutor                1         1       57541810         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,278  StatusLogger.java:66 - AntiEntropyStage                  1         1         873618         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,278  StatusLogger.java:66 - PendingRangeCalculator            0         0             83         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,278  StatusLogger.java:66 - Sampler                           0         0              0         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,278  StatusLogger.java:66 - MemtableFlushWriter               2         2         381132         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,278  StatusLogger.java:66 - MemtablePostFlush                 1         3         774796         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,278  StatusLogger.java:66 - MemtableReclaimMemory             0         0         486188         0                 0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,279  StatusLogger.java:75 - CompactionManager                 1         1
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,279  StatusLogger.java:87 - MessagingService                n/a       3/0
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,279  StatusLogger.java:97 - Cache Type                     Size                 Capacity               KeysToSave
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,279  StatusLogger.java:99 - KeyCache                  104857407                104857600                      all
INFO  [ScheduledTasks:1] 2019-02-07 06:03:30,279  StatusLogger.java:105 - RowCache                          0                        0                      all

另外可以使用

 nodetool <options> tpstats 

来看各job/queue等情况,类似上面的结果。

5 看网络:

看流量:安装iftop

wget  http://download-ib01.fedoraproject.org/pub/epel/6/x86_64/Packages/i/iftop-1.0-0.14.pre4.el6.x86_64.rpm

看网络接口详细信息:

 cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 38:90:a5:14:9b:74
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 38:90:a5:14:9b:75
Slave queue ID: 0

6 多对比

特别是当仅仅部分机器出问题的时候,可以使用ops center来对比,没有的话,只能一个一个用命令等方式做对比。

一些困惑解析:
1 read请求明明是5s timeout,为什么有的时候2s timeout?

因为read有的时,会触发read repair,特别是本来cassandra写入就经常出问题的时候,而因为bug的存在:https://issues.apache.org/jira/browse/CASSANDRA-10726%3E
导致repair的时候,会阻塞等待,而阻塞等待的时间是写超时的时间(READ_REPAIR)。

2 Timeout waiting for task.
这种错误,很少发生,发生的问题问题根源定位在网络。

3 cassandra的核心操作?也就是最容易出现性能瓶颈问题的根源。
写内存+写Commit log(可关闭)

总结:

查询问题,零散烦复,主要就是找规律,多对比,全面查,限于篇幅,不详述了。

performance turning-linux

1 lsblk

[root@localhost]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1.1T  0 disk 
├─sda1   8:1    0     1G  0 part /boot
├─sda2   8:2    0   100G  0 part /
├─sda3   8:3    0   450G  0 part /backup
├─sda4   8:4    0     1K  0 part 
├─sda5   8:5    0   100G  0 part /u00
├─sda6   8:6    0   100G  0 part /var
├─sda7   8:7    0    50G  0 part /tmp
└─sda8   8:8    0 315.7G  0 part /logs
sdc      8:32   0 893.1G  0 disk 
└─sdc1   8:33   0 893.1G  0 part /data1
sdb      8:16   0 893.1G  0 disk 
sde      8:64   0 893.1G  0 disk 
├─sde1   8:65   0   450G  0 part /data3
└─sde2   8:66   0 443.1G  0 part /commitlog
sdd      8:48   0 893.1G  0 disk 
└─sdd1   8:49   0 893.1G  0 part /data2

2 smartctl

[root@localhost ~]# smartctl -a /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.9.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               Cisco   
Product:              UCSC-MRAID12G   
Revision:             4.62
User Capacity:        1,198,999,470,080 bytes [1.19 TB]
Logical block size:   512 bytes
Logical Unit id:      0x6f80bcbeac41d2a023eec6c514c1ceb3
Serial number:        00b3cec114c5c6ee23a0d241acbebc80
Device type:          disk
Local Time is:        Fri Mar  1 02:47:43 2019 GMT
Device does not support SMART

3 iostat -x -p -t
-t 显示时间戳
-p 会把dev0-8之类转化为sda1.

sdc               0.00    40.23    0.99    2.12   144.29   338.80   155.52     0.02    7.68    0.36   11.09   0.21   0.06
sdc1              0.00    40.23    0.99    2.12   144.29   338.80   155.52     0.02    7.68    0.36   11.09   0.21   0.06
sdb               0.00     0.00    0.00    0.00     0.00     0.00    24.89     0.00    0.14    0.14    0.00   0.14   0.00
sde               0.00    80.05    0.20    2.97    16.89   664.23   214.30     0.02    7.20    0.39    7.66   0.34   0.11
sde1              0.00    19.64    0.20    1.03    16.88   165.34   147.84     0.01    9.42    0.38   11.21   0.22   0.03
sde2              0.00    60.42    0.00    1.95     0.00   498.89   256.40     0.01    5.79    3.80    5.79   0.42   0.08
sdd               0.00    38.47    1.80    2.01   237.10   323.88   147.09     0.02    5.95    0.35   10.95   0.20   0.07
sdd1              0.00    38.47    1.80    2.01   237.10   323.88   147.09     0.02    5.95    0.35   10.95   0.20   0.07

4 查看分区类型

[root@host ~]# cat /etc/fstab

UUID=74d82177-9ffd-4564-94d6-fc97cfdda575 / ext4 defaults 1 1
UUID=b6d58e19-7800-4284-96ae-615744a40f9d /backup ext4 defaults 1 2
UUID=465299be-d033-4e52-9359-125face600fc /boot ext2 defaults 1 2
UUID=cf442cbb-f33a-444d-be16-ae82a54a1802 /commitlog ext4 noatime 1 2
UUID=26c93b16-ce00-4f75-a17f-b56f7dcbd8ba /data1 ext4 noatime 1 2
UUID=65166b86-d9ac-41a8-be07-95c1e1d5e8d0 /data2 ext4 noatime 1 2
UUID=f1533b88-c494-41eb-8407-3951b5999486 /data3 ext4 noatime 1 2
UUID=c897ac33-7170-42d6-85c9-2d599e22dda0 /logs ext4 defaults 1 2
UUID=4cfed02e-31ea-4d09-970c-f906e3cc1c8a /tmp ext4 defaults 1 2
UUID=fbb944ee-6b21-4571-b1d8-2cc37e47538c /u00 ext4 defaults 1 2
UUID=fa2c170d-71b1-4707-a692-8a12c4d83706 /var ext4 defaults 1 2

faults on coding (to be continued)

零零散散的在编码过程中犯过一些错误,所以打算专门写一篇总结,持续记录这些错误,以便知错能改。

1 work一直良好的系统中,明显的错误不要改,例如:

private int age;
public void setAge(int age){
 age = age;
}

有些功能实际上错上加错的结果,当纠正一个错误的时候,结果就变成了错误。

2 理所当然的思维惯性。

下面的结果是什么?是不是true?

	Properties properties = new Properties();
        properties.put("key",true); 
        
	String valueStr = properties.getProperty("key");
	System.out.println(valueStr);
	System.out.println(Boolean.TRUE.toString().equalsIgnoreCase(valueStr));

源码分析:
实际上设置的是boolean型的值,然后getProperty返回的是null:

	    public String getProperty(String key) {
	        Object oval = super.get(key);
	        String sval = (oval instanceof String) ? (String)oval : null;
	        return ((sval == null) && (defaults != null)) ? defaults.getProperty(key) : sval;
	    }

3 针对接口编程和误用方法:

下面的写法试图cache一个值,当不存在对应的value时,自动加载一个值,那下面的值返回什么?

cache loader

  		CacheLoader<String, String> loader = new CacheLoader<String, String>() {

			@Override
			public String load(String key) throws Exception {
 				return key + "'s value";
			}
		};

使用:

		Cache<String, String> cache = CacheBuilder.newBuilder().maximumSize(10_000).expireAfterWrite(15, TimeUnit.MINUTES).build(loader);
  		String value = cache.getIfPresent("key1");

实际正确的写法

  		LoadingCache<String,String> cache = CacheBuilder.newBuilder().maximumSize(10_000).expireAfterWrite(15, TimeUnit.MINUTES).build(loader);
   		String value = cache.get("key1");

问题出在,一定要仔细阅读API的文档,不要想当然,同时针对接口编程,不定是针对顶层接口编程,如果上来就赋予给顶层接口,则后面的方法选择范围就比较小,可能就不假思考。

4 不假思索的认为某种类型是枚举

在判断响应是否是JSON body时,误以为MediaType.APPLICATION_JSON_TYPE肯定被定义成枚举,所以直接用==判断。

 MediaType.APPLICATION_JSON_TYPE == response.getMediaType();

修改:不定是每个感觉应该定义成枚举类型的就会定义成枚举类型,另外没有搞清楚状况前,用equals肯定比==更安全,

MediaType.APPLICATION_JSON_TYPE.equals(response.getMediaType()) 

5 Arrays.aslist返回的list?

下面的代码能编译,但是有没有问题?

		List<String> asList = Arrays.asList("123", "456");
		asList.removeIf(string -> string.equalsIgnoreCase("123"));

结果:

Exception in thread "main" java.lang.UnsupportedOperationException
	at java.util.AbstractList.remove(AbstractList.java:161)
	at java.util.AbstractList$Itr.remove(AbstractList.java:374)
	at java.util.Collection.removeIf(Collection.java:415)
	at com.github.metrics.Metric.main(Metric.java:29)

源码解析:

    @SuppressWarnings("varargs")
    public static <T> List<T> asList(T... a) {
        return new ArrayList<>(a); //此处返回的list是java.util.Arrays.ArrayList<E>,而并不是普通的java.util.ArrayList<E>
    }

而这种list的一些实现并未实现:

    public E remove(int index) {
        throw new UnsupportedOperationException();
    }

思考:可以调用的方法不见得可以work,返回看起来名字一样的,但是不见得是一个。

http protocol (1) – different types of timeout

项目中给http client 4.x设置了各种timeout,但是还是发现一些请求耗费的时间远超过预期,排查一部分请求是由于虚拟机或者GC等情况引起,还有一部分长耗时请求归咎于DNS解析的耗时,所以有必要梳理下http client使用中的各种timeout设置,从而对一个请求完成的最大时间有个正确而清晰的认识:

分析http client中,一次http请求的流程大致有2条主线如下:

(1)从连接池获取到连接->发送http请求->接受http响应

(2)从连接池获取连接,但是没有可用的,此时又可以创建新连接情况下-> DNS解析->建立连接(如果一个DNS可以解析成多个地址,第一个地址连不上,会挨个尝试)-> TLS握手(假设是https)->发送http请求->接受http响应。

分解上面的2条主线,至少有4种timeout(socket timeout, connection timeout, request connection timeout, dns resolve timeout)影响了一次http请求的完成时间。

(1)从连接池获取连接的最长等待时间
通过复用连接池的连接,可以避免不断频繁开关连接,提高效率,所以从连接池获取连接时,当连接池满时,可以设置等待一定的时间来获取连接,而不是永久等待。

1.1 错误堆栈:

  org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
! at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) 
! at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) 

1.2 源码解析:

   @Override
    public ConnectionRequest requestConnection(
            final HttpRoute route,
            final Object state) {
        Args.notNull(route, "HTTP route");
        if (this.log.isDebugEnabled()) {
            this.log.debug("Connection request: " + format(route, state) + formatStats(route));
        }
        final Future<CPoolEntry> future = this.pool.lease(route, state, null);  //从连接池获取连接
        return new ConnectionRequest() {

            @Override
            public boolean cancel() {
                return future.cancel(true);
            }

            @Override
            public HttpClientConnection get(
                    final long timeout,
                    final TimeUnit tunit) throws InterruptedException, ExecutionException, ConnectionPoolTimeoutException {
                return leaseConnection(future, timeout, tunit);
            }

        };

    }
    protected HttpClientConnection leaseConnection(
            final Future<CPoolEntry> future,
            final long timeout,
            final TimeUnit tunit) throws InterruptedException, ExecutionException, ConnectionPoolTimeoutException {
        final CPoolEntry entry;
        try {
            entry = future.get(timeout, tunit);
            if (entry == null || future.isCancelled()) {
                throw new InterruptedException();
            }
            Asserts.check(entry.getConnection() != null, "Pool entry with no connection");
            if (this.log.isDebugEnabled()) {
                this.log.debug("Connection leased: " + format(entry) + formatStats(entry.getRoute()));
            }
            return CPoolProxy.newProxy(entry);
        } catch (final TimeoutException ex) {
            throw new ConnectionPoolTimeoutException("Timeout waiting for connection from pool");
        }
    }

1.3 设置方法:

	RequestConfig requestConfig = RequestConfig.custom().
 				setConnectionRequestTimeout(2 * 1000).

(2)DNS解析时间

拿到域名建立连接之后,需要进行dns解析,一般dns解析都会设置cache时间,例如5分钟,但是为了防止绑死在某个主机上,都不会永久cache,所以dns解析不可避免,然而dns解析的时间控制除了本身将这个过程异步化外并未提供任何可以设置的时间参数。

2.1 堆栈错误:

Caused by: java.net.UnknownHostException: nebulaik.webex.com.cn: Temporary failure in name resolution
        at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
        at java.net.InetAddress.getAllByName(InetAddress.java:1192)
        at java.net.InetAddress.getAllByName(InetAddress.java:1126)
        at com.webex.dsagent.client.http.DSADnsResolver.resolve(DSADnsResolver.java:24)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:111)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)

2.2 源码分析:

大多请求遇到这种错误,都耗时15秒,原因在于系统配置(/etc/resolv.conf ):

options rotate timeout:5 retries:3
nameserver 10.224.91.88
nameserver 10.224.91.99

诊断方法可以使用dig命令:

dig命令提供了2个参数:

+time=T
Sets the timeout for a query to T seconds. The default timeout is 5 seconds. An attempt to set T to less than 1 will result in a query timeout of 1 second being applied.
+tries=T
Sets the number of times to try UDP queries to server to T instead of the default, 3. If T is less than or equal to zero, the number of tries is silently rounded up to 1.

例子:

[root@centos~]# time dig www.baidu.com +time=1

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.4 <<>> www.baidu.com +time=1
;; global options: +cmd
;; connection timed out; no servers could be reached

real  0m3.010s
user 0m0.003s
sys  0m0.007s
[root@centos~]# time dig www.baidu.com +time=5

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.4 <<>> www.baidu.com +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached

real  0m15.013s
user 0m0.006s
sys  0m0.005s

另外一个追踪方法和问题实例:解析aaa.com.cn超时了。

dig +trace aaa.com.cn //示例

使用dig +trace定位到:

aaa.com.cn. 600 IN CNAME bbb.com.cn.  //别名
bbb.com.cn. 120 IN NS ns1.dns.com.cn.
bbb.com.cn. 120 IN NS ns2.dns.com.cn.

其中ns1.dns.com.cn和ns2.dns.com.cn负责解析这个域名,但是做了环境迁移,导致不可达。后来修改下这2条记录,就不再超时了。

2.3 设置方法:

要不修改配置文件(options rotate timeout:5 retries:3),要不将dns解析异步化。

附上各种DNS错误:
https://zhuanlan.zhihu.com/p/40659713

题外:如果一个DNS可以解析成多个地址,而第一个地址连不上时,则一个一个尝试,直到成功或者最后一个,所以很多timeout特别长,也在于此。源码分析:

org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(ManagedHttpClientConnection, HttpHost, InetSocketAddress, int, SocketConfig, HttpContext)

   @Override
    public void connect(
            final ManagedHttpClientConnection conn,
            final HttpHost host,
            final InetSocketAddress localAddress,
            final int connectTimeout,
            final SocketConfig socketConfig,
            final HttpContext context) throws IOException {
        final Lookup<ConnectionSocketFactory> registry = getSocketFactoryRegistry(context);
        final ConnectionSocketFactory sf = registry.lookup(host.getSchemeName());
        if (sf == null) {
            throw new UnsupportedSchemeException(host.getSchemeName() +
                    " protocol is not supported");
        }
        final InetAddress[] addresses = host.getAddress() != null ?
                new InetAddress[] { host.getAddress() } : this.dnsResolver.resolve(host.getHostName());
        final int port = this.schemePortResolver.resolve(host);
        for (int i = 0; i < addresses.length; i++) {
            final InetAddress address = addresses[i];
            final boolean last = i == addresses.length - 1;

            Socket sock = sf.createSocket(context);
            sock.setSoTimeout(socketConfig.getSoTimeout());
            sock.setReuseAddress(socketConfig.isSoReuseAddress());
            sock.setTcpNoDelay(socketConfig.isTcpNoDelay());
            sock.setKeepAlive(socketConfig.isSoKeepAlive());
            final int linger = socketConfig.getSoLinger();
            if (linger >= 0) {
                sock.setSoLinger(true, linger);
            }
            conn.bind(sock);

            final InetSocketAddress remoteAddress = new InetSocketAddress(address, port);
            if (this.log.isDebugEnabled()) {
                this.log.debug("Connecting to " + remoteAddress);
            }
            try {
                sock = sf.connectSocket(
                        connectTimeout, sock, host, remoteAddress, localAddress, context);
                conn.bind(sock);
                if (this.log.isDebugEnabled()) {
                    this.log.debug("Connection established " + conn);
                }
                return;
            } catch (final SocketTimeoutException ex) {
                if (last) {
                    throw new ConnectTimeoutException(ex, host, addresses);
                }
            } catch (final ConnectException ex) {
                if (last) {
                    final String msg = ex.getMessage();
                    if ("Connection timed out".equals(msg)) {
                        throw new ConnectTimeoutException(ex, host, addresses);
                    } else {
                        throw new HttpHostConnectException(ex, host, addresses);
                    }
                }
            } catch (final NoRouteToHostException ex) {
                if (last) {
                    throw ex;
                }
            }
            if (this.log.isDebugEnabled()) {
                this.log.debug("Connect to " + remoteAddress + " timed out. " +
                        "Connection will be retried using another IP address");
            }
        }
    }

(3)连接建立时间

3.1 错误的堆栈:

Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to demosite.yy.zz:443 failed: connect timed out
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.jboss.resteasy.client.jaxrs.engines.ApacheHttpClient4Engine.invoke(ApacheHttpClient4Engine.java:312)
        ... 69 more
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:337)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
        ... 79 more

3.2 源码解析:

    void socketConnect(InetAddress address, int port, int timeout)
        throws IOException {
        int nativefd = checkAndReturnNativeFD();
 
        int connectResult;
        if (timeout <= 0) {  //不设置timeout的时间,永久阻塞
            connectResult = connect0(nativefd, address, port);
        } else {  //设置timeout时间,只等timeout时间。
            configureBlocking(nativefd, false);  //先设置为非阻塞模式
            try {
                connectResult = connect0(nativefd, address, port); //做个连接
                if (connectResult == WOULDBLOCK) { 
                    waitForConnect(nativefd, timeout); //最多等待阻塞timoout时间
                }
            } finally {
                configureBlocking(nativefd, true);  //设置回阻塞模式
            }
        }
 
    }
 

3.3 设置方法:

  		RequestConfig requestConfig = RequestConfig.custom().
 		setConnectTimeout(2 * 1000).

(4)数据处理时间:

发送完请求后,等待响应需要一定的时间,这个时间即为数据处理时间。除此之外,建立连接后的tls握手过程,也有多个请求响应过程,这个时间参数对这个交互过程也适用。

4.1 错误堆栈:

Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
        at sun.security.ssl.InputRecord.read(InputRecord.java:503)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:84)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)

4.2 源码解析:

http发完请求后,立马阻塞读取响应:

    public HttpResponse execute(
            final HttpRequest request,
            final HttpClientConnection conn,
            final HttpContext context) throws IOException, HttpException {
        try {
            HttpResponse response = doSendRequest(request, conn, context);
            if (response == null) {
                response = doReceiveResponse(request, conn, context); //阻塞读取响应
            }
            return response;
        }
    /**
     *  Enable/disable {@link SocketOptions#SO_TIMEOUT SO_TIMEOUT}
     *  with the specified timeout, in milliseconds. With this option set
     *  to a non-zero timeout, a read() call on the InputStream associated with
     *  this Socket will block for only this amount of time.  If the timeout
     *  expires, a <B>java.net.SocketTimeoutException</B> is raised, though the
     *  Socket is still valid. The option <B>must</B> be enabled
     *  prior to entering the blocking operation to have effect. The
     *  timeout must be {@code > 0}.
     *  A timeout of zero is interpreted as an infinite timeout.
     *
     * @param timeout the specified timeout, in milliseconds.
     * @exception SocketException if there is an error
     * in the underlying protocol, such as a TCP error.
     * @since   JDK 1.1
     * @see #getSoTimeout()
     */
    public synchronized void setSoTimeout(int timeout) throws SocketException {  //仅仅对读有效
        if (isClosed())
            throw new SocketException("Socket is closed");
        if (timeout < 0)
          throw new IllegalArgumentException("timeout can't be negative");

        getImpl().setOption(SocketOptions.SO_TIMEOUT, new Integer(timeout));
    }
    /**
     * Reads into an array of bytes at the specified offset using
     * the received socket primitive.
     * @param fd the FileDescriptor
     * @param b the buffer into which the data is read
     * @param off the start offset of the data
     * @param len the maximum number of bytes read
     * @param timeout the read timeout in ms
     * @return the actual number of bytes read, -1 is
     *          returned when the end of the stream is reached.
     * @exception IOException If an I/O error has occurred.
     */
    private native int socketRead0(FileDescriptor fd,
                                   byte b[], int off, int len,
                                   int timeout) //阻塞读取timeout时间

4.3 设置方法:

  	RequestConfig requestConfig = RequestConfig.custom().
 	setSocketTimeout(requestTimeoutInSecond * 1000).

假设不设置so_timeout的同时,设置了上文(3)中提到的connection timeout,则so_timeout = connection timeout

org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(int, Socket, HttpHost, InetSocketAddress, InetSocketAddress, HttpContext)
            if (connectTimeout > 0 && sock.getSoTimeout() == 0) {
                sock.setSoTimeout(connectTimeout);
            }

总结:
除了以上几种直接影响处理耗时的timeout参数为,尝试机制是另外一个影响处理耗时的机制,而且基本都是“翻倍”,这里不做赘述。所以从这个角度看,影响一次请求的耗时,主要包括以下因素:
(1) GC的STW效应
(2) 虚拟机的影响
(3) 应用层/系统层各种timeout参数
(4) 应用层/系统层重试次数

通过上文可知,除非将一个请求过程完全异步化,否则必须将所有的参数都了然于心才能很有信心的了解一个请求到底最长需要多久来完成,但是异步化的缺点就是管理的复杂性、缺乏负反馈等问题,需要综合权衡。

metric driven (6) – common arch solutions

TIG:

telegraf

1 install

1.1 create /etc/yum.repos.d/influxdb.repo:

[influxdb]
name = InfluxDB Repository – RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key

1.2 sudo yum install telegraf

1.3 startup

sudo service telegraf start
Or if your operating system is using systemd (CentOS 7+, RHEL 7+):
sudo systemctl start telegraf

2 Config:

默认配置文件为/etc/telegraf/telegraf.conf,也可以查看https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf, telegraf是通过输入、转化,输出插件方式来管理的。

所以默认什么都不做修改时的,telegraf收集的是如下信息:

inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu

而输出采用的是influxdb方式。这点可以通过启动日志来观察到:

2018/09/17 01:31:19 I! Using config file: /etc/telegraf/telegraf.conf
2018-09-17T01:31:19Z W! [outputs.influxdb] when writing to [http://localhost:8086]: database “telegraf” creation failed: Post http://localhost:8086/query: dial tcp 127.0.0.1:8086: connect: connection refused
2018-09-17T01:31:19Z I! Starting Telegraf v1.7.4
2018-09-17T01:31:19Z I! Loaded inputs: inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu
2018-09-17T01:31:19Z I! Loaded aggregators:
2018-09-17T01:31:19Z I! Loaded processors:
2018-09-17T01:31:19Z I! Loaded outputs: influxdb
2018-09-17T01:31:19Z I! Tags enabled: host=appOne
2018-09-17T01:31:19Z I! Agent Config: Interval:10s, Quiet:false, Hostname:”telegraf”, Flush Interval:10s
2018-09-17T01:31:30Z E! [outputs.influxdb]: when writing to [http://localhost:8086]: Post http://localhost:8086/write?db=telegraf: dial tcp 127.0.0.1:8086

所以如果需要修改或者定制可以直接修改/etc/telegraf/telegraf.conf达到目标,但是默认配置里面有太多冗余插件信息去注释掉,所以telegraf提供了一种简洁的方式来产生配置文件。

#telegraf –input-filter redis:cpu:mem:net:swap –output-filter influxdb:kafka config //采集多个指标
#telegraf –input-filter redis –output-filter influxdb config //采集一个指标

例如,产生一个redis.conf的配置:

#telegraf -sample-config -input-filter redis:mem -output-filter influxdb > redis.conf

产生后的配置内容如下:

###############################################################################
# INPUT PLUGINS #
###############################################################################

# Read metrics about memory usage
[[inputs.mem]]
# no configuration

[[inputs.redis]]
## specify servers via a url matching:
## [protocol://][:password]@address[:port]
## e.g.
## tcp://localhost:6379
## tcp://:password@192.168.99.100
## unix:///var/run/redis.sock
##
## If no servers are specified, then localhost is used as the host.
## If no port is specified, 6379 is used
servers = [“tcp://localhost:6379”]

###############################################################################
# OUTPUT PLUGINS #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
## The full HTTP or UDP URL for your InfluxDB instance.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
# urls = [“unix:///var/run/influxdb.sock”]
# urls = [“udp://127.0.0.1:8089”]
# urls = [“http://127.0.0.1:8086”]

## The target database for metrics; will be created as needed.
# database = “telegraf”
# username = “telegraf”
# password = “metricsmetricsmetricsmetrics”

然后以这个文件作为启动配置文件启动:

#telegraf –config /etc/telegraf/redis.conf

[root@telegraf ~]# telegraf –config /etc/telegraf/redis.conf
2018-09-17T02:43:08Z I! Starting Telegraf v1.7.4
2018-09-17T02:43:08Z I! Loaded inputs: inputs.redis inputs.mem
2018-09-17T02:43:08Z I! Loaded aggregators:
2018-09-17T02:43:08Z I! Loaded processors:
2018-09-17T02:43:08Z I! Loaded outputs: influxdb
2018-09-17T02:43:08Z I! Tags enabled: host=telegraf
2018-09-17T02:43:08Z I! Agent Config: Interval:10s, Quiet:false, Hostname:”telegraf “, Flush Interval:10s

此时,influxdb会受到请求:

2018-09-17T02:43:08.060799Z info Executing query {“log_id”: “0AaMBDO0000”, “service”: “query”, “query”: “CREATE DATABASE telegraf”}
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:08 +0000] “POST /query HTTP/1.1” 200 57 “-” “telegraf” 68dafe05-ba23-11e8-8001-000000000000 108642
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:20 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 7026fecd-ba23-11e8-8002-000000000000 595855
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:30 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 761ceb12-ba23-11e8-8003-000000000000 149522
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:40 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 7c12cd50-ba23-11e8-8004-000000000000 326783
[httpd] 127.0.0.1 – – [17/Sep/2018:02:43:50 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 820892ba-ba23-11e8-8005-000000000000 101009
[httpd] 127.0.0.1 – – [17/Sep/2018:02:44:00 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 87fe77d9-ba23-11e8-8006-000000000000 86017
[httpd] 127.0.0.1 – – [17/Sep/2018:02:44:10 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “telegraf” 8df464b0-ba23-11e8-8007-000000000000 85689

通过influxdb的client命令就可以查询到收集到的信息了,非常简单方便:

[root@influx ~]# influx
Connected to http://localhost:8086 version 1.6.2
InfluxDB shell version: 1.6.2
> show databases
name: databases
name
—-
_internal
telegraf
> use telegraf
Using database telegraf
>
> show measurements
name: measurements
name
—-
mem
redis

> select * from redis limit 1;
name: redis
time aof_current_rewrite_time_sec aof_enabled aof_last_bgrewrite_status aof_last_rewrite_time_sec aof_last_write_status aof_rewrite_in_progress aof_rewrite_scheduled blocked_clients client_biggest_input_buf client_longest_output_list clients cluster_enabled connected_slaves evicted_keys expired_keys host instantaneous_input_kbps instantaneous_ops_per_sec instantaneous_output_kbps keyspace_hitrate keyspace_hits keyspace_misses latest_fork_usec loading lru_clock master_repl_offset maxmemory maxmemory_policy mem_fragmentation_ratio migrate_cached_sockets port pubsub_channels pubsub_patterns rdb_bgsave_in_progress rdb_changes_since_last_save rdb_current_bgsave_time_sec rdb_last_bgsave_status rdb_last_bgsave_time_sec rdb_last_save_time rdb_last_save_time_elapsed redis_version rejected_connections repl_backlog_active repl_backlog_first_byte_offset repl_backlog_histlen repl_backlog_size replication_role server slave0 sync_full sync_partial_err sync_partial_ok total_commands_processed total_connections_received total_net_input_bytes total_net_output_bytes total_system_memory uptime used_cpu_sys used_cpu_sys_children used_cpu_user used_cpu_user_children used_memory used_memory_lua used_memory_peak used_memory_rss
—- —————————- ———– ————————- ————————- ——————— ———————– ——————— ————— ———————— ————————– ——- ————— —————- ———— ———— —- ———————— ————————- ————————- —————- ————- ————— —————- ——- ——— —————— ——— —————- ———————– ———————- —- ————— ————— ———————- ————————— ————————— ———————- ———————— —————— ————————– ————- ——————– ——————- —————————— ——————– —————– —————- —— —— ——— —————- ————— ———————— ————————– ——————— ———————- ——————- —— ———— ——————— ————- ———————- ———– ————— —————- —————
1537152190000000000 -1 0 ok -1 ok 0 0 0 0 0 41 1 1 0 778 telegraf 0.09 2 0.01 1 188 0 379 0 10425533 16473380 8000000000 allkeys-lru 1.17 0 7001 0 0 0 856 -1 ok 1 1530088772 7063418 3.2.8 0 1 15424805 1048576 1048576 master 10.224.91.231 ip=10.224.91.234,port=7001,state=online,offset=16473380,lag=1 2 0 0 19620365 1239692 500589135 885305642 33670017024 11549541 15528.8 0 8857.04 0 4476504 37888 5601248 5259264
>

select * from mem limit 1;
name: mem
time active available available_percent buffered cached free host inactive slab total used used_percent wired
—- —— ——— —————– ——– —— —- —- ——– —- —– —- ———— —–
1537152190000000000 771219456 7859949568 93.83099562612006 422666240 890130432 6547152896 telegraf 860303360 142872576 8376709120 516759552 6.169004373879942 0
>
>

grafana

1 install

注意安装要求64位机器:

a. 创建grafana安装源 /etc/yum.repos.d/grafana.repo

[grafana]
name=grafana
baseurl=https://packagecloud.io/grafana/stable/el/7/$basearch
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packagecloud.io/gpg.key https://grafanarel.s3.amazonaws.com/RPM-GPG-KEY-grafana
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

2. 安装和启动

$ sudo yum install grafana
$ sudo service grafana-server start

 

启动后,默认HTTP port 是3000, 默认用户和用户组是admin.

加入启动时运行列表:

$ sudo /sbin/chkconfig --add grafana-server

3. 使用

a 创建数据源: 数据源支持很多种,例如常见的influxdb,elastic search和mysql等等。

b 创建dashboard, 要点就是选择步骤1创建的数据源,然后绘制各种图形。

上面2步即可完成基本操作,然后可以基于绘制的数据创建alert,不做赘述。

ELKK