一次ceph节点时钟同步异常排查总结
•
Jave
问题现象
ceph mon节点时钟同步异常:
$ sudo /var/lib/ceph/bin/ceph -s
cluster:
id: 3fe6c651-2a0c-4f15-851b-7215536897eb
health: HEALTH_WARN
clock skew detected on mon.c
$ sudo /var/lib/ceph/bin/ceph health detail
HEALTH_WARN clock skew detected on mon.c
MON_CLOCK_SKEW clock skew detected on mon.c
mon.c clock skew 0.274578s > max 0.15s (latency 0.000160198s)
集群配置的最大允许时钟偏差为0.15s:
$ cat /var/lib/ceph/etc/ceph/ceph.conf | grep mon_clock_drift_allowed mon_clock_drift_allowed = 0.15
使用date +%s.%N命令可确认系统精确时间,经确认非误报,确实存在时钟同步异常。
集群节点间采用chronyd进行时钟同步,所有节点的时钟源均配置为第一个mon节点:
$ cat /etc/chrony.conf driftfile /var/lib/chrony/drift rtcsync local stratum 10 #default server 10.127.15.182 minpoll 0 maxpoll 0 logdir /var/log/chrony log measurements statistics tracking
问题定位
1、确认时钟同步状态
$ sudo chronyc sources -v 210 Number of sources = 1 .-- Source mode '^' = server, '=' = peer, '#' = local clock. / .- Source state '*' = current synced, '+' = combined , '-' = not combined, | / '?' = unreachable, 'x' = time may be in error, '~' = time too variable. || .- xxxx [ yyyy ] +/- zzzz || Reachability register (octal) -. | xxxx = adjusted offset, || Log2(Polling interval) --. | | yyyy = measured offset, || \ | | zzzz = estimated error. || | | \ MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^? 10.127.15.182 0 0 0 - +0ns[ +0ns] +/- 0ns
显示^?,即unreachable,时钟源不可达。
2、尝试手动同步时间
$ sudo chronyc -a makestep 200 OK $ sudo chronyc sources -v 210 Number of sources = 1 .-- Source mode '^' = server, '=' = peer, '#' = local clock. / .- Source state '*' = current synced, '+' = combined , '-' = not combined, | / '?' = unreachable, 'x' = time may be in error, '~' = time too variable. || .- xxxx [ yyyy ] +/- zzzz || Reachability register (octal) -. | xxxx = adjusted offset, || Log2(Polling interval) --. | | yyyy = measured offset, || \ | | zzzz = estimated error. || | | \ MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^? 10.127.15.182 0 0 0 - +0ns[ +0ns] +/- 0ns
无效,时钟无法正常同步。
3、检查网络连通性
$ ping 10.127.15.182 PING 10.127.15.182 (10.127.15.182) 56(84) bytes of data. 64 bytes from 10.127.15.182: icmp_seq=1 ttl=64 time=0.011 ms 64 bytes from 10.127.15.182: icmp_seq=2 ttl=64 time=0.013 ms 64 bytes from 10.127.15.182: icmp_seq=3 ttl=64 time=0.017 ms ^C --- 10.127.15.182 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 0.011/0.013/0.017/0.002 ms
ping可达,暂判断网络连通性无异常(10.127.15.182为时钟源的浮动地址)。
4、检查防火墙配置
节点处于同一子网下,检查操作系统防火墙配置:
$ sudo iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination
配置无异常。
5、抓包进一步判断unreachable原因
在时钟源尝试抓取异常节点ntp数据包,未抓取到数据包,可判断ntp同步包未送达时钟源:
$ sudo tcpdump -nn -i bond0.3530 udp and host 10.127.15.156 and port 123 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bond0.3530, link-type EN10MB (Ethernet), capture size 262144 bytes
在异常节点进行的同样的抓包操作,也未抓取到数据包,此时结合第3步ping可达,出现一个奇怪的现象,即ntp同步包未到达自身网卡。在异常节点长ping时钟源的地址10.127.15.182,并在时钟源抓包:
$ sudo tcpdump -nn -i bond0.3530 icmp and host 10.127.15.182 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bond0.3530, link-type EN10MB (Ethernet), capture size 262144 bytes
同样未抓取到数据,可见icmp包未达到时钟源,但ping却可达,由此可以判断数据包送错了主机。
6、进一步判断异常节点访问时钟源时数据包送往何处
检查arp表无对应mac地址:
$ ip neigh show | grep 10.127.15.182 # 或 $ arp -n | grep 10.127.15.182
检查路由表发现路由走到了回环接口lo:
$ ip route get 10.127.15.182
local 10.127.15.182 dev lo src 10.127.15.182 uid 1003
cache
$ ip route show table local
xxxxxx
local 10.127.15.156 dev bond0.3530 proto kernel scope host src 10.127.15.156
local 10.127.15.182 dev bond0.3530 proto kernel scope host src 10.127.15.182
xxxxxx
$ ip a
xxxxxx
15: bond0.3530@bond0: mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 8c:2a:8e:57:5c:d5 brd ff:ff:ff:ff:ff:ff
inet 10.127.15.156/26 brd 10.127.15.191 scope global noprefixroute bond0.3530
valid_lft forever preferred_lft forever
inet6 2409:8c00:7821:4000::a7f:f9c/122 scope global noprefixroute
valid_lft forever preferred_lft forever
inet6 fe80::b9ed:8ee:b44d:286e/64 scope link noprefixroute
valid_lft forever preferred_lft forever
10.127.15.182非本机地址,却走到了回环接口,问题定位。
问题解决
删除异常路由条目:
$ sudo ip route delete table local 10.127.15.182 dev bond0.3530 src 10.127.15.182
或重启网络服务
$ sudo systemctl restart NetworkManager
本文来自网络,不代表协通编程立场,如若转载,请注明出处:https://net2asp.com/013f2c8d17.html
