k8s IPVS模式下externalIP导致节点故障浅析
背景
k8s集群一旦将svc中的externalIP设置成集群内任何一个节点IP,就会导致calico、kubelet、kube-proxy等组件无法与apiserver进行通信
环境
主机名 | IP |
---|---|
k8s-master-1(k8s-v1.20.10) | 192.168.0.10 |
k8s-node-1(k8s-v1.20.10) | 192.168.0.11 |
# Pod-IP10.0.0.0/16# Service-IP10.70.0.0/16
现象
# 测试yaml[root@k8s-master-1 externalip]# cat deployment.yaml apiVersion: apps/v1kind: Deploymentmetadata: name: busyboxspec: replicas: 1 selector: matchLabels: app: httpd template: metadata: labels: app: httpd spec: containers: - name: busybox image: busybox:1.28 imagePullPolicy: IfNotPresent command: ["/bin/sh","-c","echo -c 'this is httpd-v1 > /var/www/index.html';httpd -f -h /var/www"] ports: - containerPort: 80---apiVersion: v1kind: Servicemetadata: name: busyboxspec: externalIPs: - 192.168.0.11 type: ClusterIP ports: - port: 8888 targetPort: 80 protocol: TCP selector: app: httpd
# 查看集群状态[root@k8s-master-1 yaml]# kubectl get pods -ANAMESPACE NAME READY STATUS RESTARTS AGEkube-system calico-kube-controllers-5855d94c7d-xz45j 1/1 Running 0 29skube-system calico-node-4ftkk 1/1 Running 0 28skube-system calico-node-pcsw6 1/1 Running 0 28skube-system coredns-6f4c9cb7c5-2wsww 1/1 Running 0 13s
# 部署externalip[root@k8s-master-1 externalip]# kubectl apply -f deployment.yaml# 过一会儿查看pod,calico-node无法ready了[root@k8s-master-1 yaml]# kubectl get pods -ANAMESPACE NAME READY STATUS RESTARTS AGEdefaultbusybox-58984c55cc-44b6c 0/1 Pending 0 3m21skube-system calico-kube-controllers-5855d94c7d-xz45j 1/1 Running 0 5m13skube-system calico-node-4ftkk 0/1 Running 0 5m12skube-system calico-node-pcsw6 1/1 Running 0 5m12skube-system coredns-6f4c9cb7c5-2wsww 1/1 Running 0 4m57s# 查看k8s-node-1日志==> kube-proxy.INFO <==I0420 16:32:40.293230 4655 service.go:275] Service default/busybox updated: 1 portsI0420 16:32:40.293615 4655 service.go:390] Adding new service port "default/busybox" at 10.0.237.220:8888/TCPI0420 16:32:40.367114 4655 proxier.go:2243] Opened local port "externalIP for default/busybox" (192.168.0.11:8888/tcp)==> kube-proxy.k8s-node-1.root.log.INFO.20220420-161329.4655 <==I0420 16:32:40.293230 4655 service.go:275] Service default/busybox updated: 1 portsI0420 16:32:40.293615 4655 service.go:390] Adding new service port "default/busybox" at 10.0.237.220:8888/TCPI0420 16:32:40.367114 4655 proxier.go:2243] Opened local port "externalIP for default/busybox" (192.168.0.11:8888/tcp)==> kubelet.ERROR <==E0420 16:32:57.962067 4333 controller.go:187] failed to update lease, error: Put "https://192.168.0.10:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/k8s-node-1?timeout=10s": context deadline exceededE0420 16:32:59.810776 4333 kubelet_node_status.go:470] Error updating node status, will retry: error getting node "k8s-node-1": Get "https://192.168.0.10:6443/api/v1/nodes/k8s-node-1?resourceVersion=0&timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers), ReportingInstance:""}': 'Post "https://192.168.0.10:6443/api/v1/namespaces/default/events": dial tcp 192.168.0.10:6443: connect: connection refused'(may retry after sleeping)E0420 13:09:58.810236 6420 kubelet.go:2263] node "k8s-node-1" not foundE0420 13:13:50.096947 8005 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://192.168.0.10:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp 192.168.0.10:6443: connect: connection refusedE0420 13:14:47.641827 8005 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.0.10:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/k8s-node-1?timeout=10s": dial tcp 192.168.0.10:6443: connect: connection refused# 测试apiserver端口,端口也不通了[root@k8s-node-1 kubernetes]# telnet 192.168.0.10 6443Trying 192.168.0.10...^C
分析
service信息
# 查看SVC(将IP修改成不是二节点的IP)[root@k8s-master-1 ~]# kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEbusybox ClusterIP 10.0.238.86 192.168.0.15 8888/TCP 4h16mkubernetes ClusterIP 10.0.0.1 <none> 443/TCP 186d# 查看POD[root@k8s-master-1 ~]# kubectl get pods -A -o wideNAMESPACE NAME READY IP NODEdefaultbusybox-58984c55cc-2jgmv 1/1 10.70.2.65 k8s-master-1 kube-system calico-kube-controllers-5855d94c7d-lzskg 1/1 192.168.0.10 k8s-master-1 kube-system calico-node-djj49 1/1 192.168.0.11 k8s-node-1 kube-system calico-node-hr9vf 1/1 192.168.0.10 k8s-master-1 kube-system coredns-6f4c9cb7c5-vrbgw 1/1 10.70.2.71 k8s-master-1
网卡信息
# k8s-master-1 网卡信息[root@k8s-master-1 ~]# ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lovalid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:34:ce:c5 brd ff:ff:ff:ff:ff:ff inet 192.168.0.10/24 brd 192.168.0.255 scope global noprefixroute ens33valid_lft forever preferred_lft forever inet6 fe80::624c:c1db:e3b4:9165/64 scope link noprefixroute valid_lft forever preferred_lft forever3: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 inet 10.70.2.64/32 scope global tunl0valid_lft forever preferred_lft forever4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:2c:39:4d:d5 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0valid_lft forever preferred_lft forever5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c6:d8:18:d4:90:5a brd ff:ff:ff:ff:ff:ff6: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether 4a:f3:f8:f2:a6:aa brd ff:ff:ff:ff:ff:ff # kube-ipvs0这张网卡上的IP为集群SVC的IP,且每个节点都有 inet 10.0.238.86/32 scope global kube-ipvs0valid_lft forever preferred_lft forever inet 192.168.0.15/32 scope global kube-ipvs0 # 这个IP就是external的,如果把这个IP设置成二个节点之一,会导致IPVS转发出问题valid_lft forever preferred_lft forever inet 10.0.0.1/32 scope global kube-ipvs0# 这个IP会将流量转发到apiservervalid_lft forever preferred_lft forever inet 10.0.0.10/32 scope global kube-ipvs0 # DNSvalid_lft forever preferred_lft forever9: calia2fcccbef15@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::ecee:eeff:feee:eeee/64 scope link valid_lft forever preferred_lft forever10: califb8bd460169@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 1 inet6 fe80::ecee:eeff:feee:eeee/64 scope link valid_lft forever preferred_lft forever # k8s-node-1 网卡信息[root@k8s-node-1 ~]# ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lovalid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:25:c5:0b brd ff:ff:ff:ff:ff:ff inet 192.168.0.11/24 brd 192.168.0.255 scope global ens33valid_lft forever preferred_lft forever inet6 fe80::20c:29ff:fe25:c50b/64 scope link valid_lft forever preferred_lft forever3: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 inet 10.70.2.0/32 scope global tunl0valid_lft forever preferred_lft forever4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:fd:f2:6b:91 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0valid_lft forever preferred_lft forever5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 7e:07:bb:db:5f:a6 brd ff:ff:ff:ff:ff:ff6: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether 5e:ea:10:20:21:f9 brd ff:ff:ff:ff:ff:ff inet 10.0.0.10/32 scope global kube-ipvs0valid_lft forever preferred_lft forever inet 10.0.238.86/32 scope global kube-ipvs0valid_lft forever preferred_lft forever inet 192.168.0.15/32 scope global kube-ipvs0valid_lft forever preferred_lft forever inet 10.0.0.1/32 scope global kube-ipvs0valid_lft forever preferred_lft forever
ipvs信息
# 查看k8s-master-1-ipvs[root@k8s-master-1 ~]# ipvsadm -lnIP Virtual Server version 1.2.1 (size=4096)Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConnTCP 192.168.0.15:8888 rr# 将192.168.0.15:8888 -> 10.70.2.65(pod-ip) -> 10.70.2.65:80 Masq 1 0 0 TCP 10.0.0.1:443 rr -> 192.168.0.10:6443 Masq 1 2 0 TCP 10.0.0.10:53 rr -> 10.70.2.71:53 Masq 1 0 0 TCP 10.0.0.10:9153 rr -> 10.70.2.71:9153Masq 1 0 0 TCP 10.0.238.86:8888 rr -> 10.70.2.65:80 Masq 1 0 0 UDP 10.0.0.10:53 rr -> 10.70.2.71:53 Masq 1 0 0 # 查看k8s-node-1-ipvs[root@k8s-node-1 ~]# ipvsadm -lnIP Virtual Server version 1.2.1 (size=4096)Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConnTCP 192.168.0.15:8888 rr -> 10.70.2.65:80 Masq 1 0 0 TCP 10.0.0.1:443 rr -> 192.168.0.10:6443 Masq 1 0 0 TCP 10.0.0.10:53 rr -> 10.70.2.71:53 Masq 1 0 0 TCP 10.0.0.10:9153 rr -> 10.70.2.71:9153Masq 1 0 0 TCP 10.0.238.86:8888 rr -> 10.70.2.65:80 Masq 1 0 0 UDP 10.0.0.10:53 rr -> 10.70.2.71:53 Masq 1 0 0
kube-ipvs0复现
# 查看网卡信息[root@boy ~]# ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lovalid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:ec:1c:2d brd ff:ff:ff:ff:ff:ff inet 192.168.0.10/24 brd 192.168.0.255 scope global noprefixroute ens33valid_lft forever preferred_lft forever inet6 fe80::624c:c1db:e3b4:9165/64 scope link noprefixroute valid_lft forever preferred_lft forever3: ens36: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:ec:1c:37 brd ff:ff:ff:ff:ff:ff inet 10.70.2.199/24 brd 10.70.2.255 scope global noprefixroute ens36valid_lft forever preferred_lft forever inet6 fe80::20c:29ff:feec:1c37/64 scope link valid_lft forever preferred_lft forever# 将ens36 down了,然后配置一个IP[root@boy ~]# ip link set ens36 down[root@boy ~]# ip addr add 192.168.0.11/32 dev ens36# 查看网络信息[root@boy ~]# ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lovalid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:ec:1c:2d brd ff:ff:ff:ff:ff:ff inet 192.168.0.10/24 brd 192.168.0.255 scope global noprefixroute ens33valid_lft forever preferred_lft forever inet6 fe80::624c:c1db:e3b4:9165/64 scope link noprefixroute valid_lft forever preferred_lft forever3: ens36: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000 link/ether 00:0c:29:ec:1c:37 brd ff:ff:ff:ff:ff:ff inet 192.168.0.11/32 scope global ens36valid_lft forever preferred_lft forever# 对192.168.0.11进行网络测试(此时ens36功能和kube-ipvs0差不多)[root@boy ~]# ping 192.168.0.11PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.039 ms64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.069 ms# 对回环口进行抓包,可见因为192.168.0.11(即使这张网卡down了)为本机IP,所以去192.168.0.11的流量全部会进入回环口[root@boy ~]# tcpdump -i lo icmp -Nnvvtcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes11:59:59.871594 IP (tos 0x0, ttl 64, id 37520, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.11 > 192.168.0.11: ICMP echo request, id 1752, seq 8, length 6411:59:59.871622 IP (tos 0x0, ttl 64, id 37521, offset 0, flags [none], proto ICMP (1), length 84) 192.168.0.11 > 192.168.0.11: ICMP echo reply, id 1752, seq 8, length 6412:00:00.871450 IP (tos 0x0, ttl 64, id 37555, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.0.11 > 192.168.0.11: ICMP echo request, id 1752, seq 9, length 6412:00:00.871478 IP (tos 0x0, ttl 64, id 37556, offset 0, flags [none], proto ICMP (1), length 84) 192.168.0.11 > 192.168.0.11: ICMP echo reply, id 1752, seq 9, length 64
根本原因
# 将externalip 修该为MASTER-IP[root@k8s-master-1 ~]# kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEbusybox ClusterIP 10.0.238.86 192.168.0.10 8888/TCP 5h30mkubernetes ClusterIP 10.0.0.1 <none> 443/TCP 186d# k8s-master-1 ping k8s-node-1kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 186d[root@k8s-master-1 ~]# ping 192.168.0.11PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.From 192.168.0.10 icmp_seq=1 Destination Host UnreachableFrom 192.168.0.10 icmp_seq=2 Destination Host UnreachableFrom 192.168.0.10 icmp_seq=3 Destination Host UnreachableFrom 192.168.0.10 icmp_seq=4 Destination Host UnreachableFrom 192.168.0.10 icmp_seq=5 Destination Host UnreachableFrom 192.168.0.10 icmp_seq=6 Destination Host Unreachable# k8s-master-1抓包[root@k8s-master-1 ~]# tcpdump -i any arp -Nvvntcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes23:23:59.687475 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:00.711622 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:01.736505 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:02.758823 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:03.783078 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:04.806981 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:05.831077 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:06.855043 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:07.878912 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 2823:24:08.903272 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 28# k8s-node-1抓包[root@k8s-node-1 ~]# tcpdump -i any arp -Nvntcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes23:24:02.732899 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 4623:24:03.756971 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 4623:24:04.780764 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 4623:24:05.804609 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 4623:24:06.828534 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 4623:24:07.852242 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.11 tell 192.168.0.10, length 46
根据上面的抓包可知,k8s-master-1将ARP-Request数据包发送给k8s-node-1,k8s-node-1的确也收到这个ARP包了,但是由于extertalip这个IP,k8s-node-1节点也拥有,所以会将k8s-master-1的ARP Request发送给lo,lo然后把数据包发送给协议栈,协议栈发送给应用程序(实际上k8s-node-1并没有程序需要这个包,这也就导致了k8s-master-1没有收到ARP响应包(这个响应包被k8s-node-1本身接收了),导致k8s-master-1无法获取k8s-node-1的MAC地址,从而导致了集群异常)
- k8s的
每一个SVC IP都会在集群内部每个节点的kube-ipvs0网卡下生成IP
,每个节点对这个IP进行ARP、ping等的流量都会被发送到本机的lo口
,只有对这个IP特定端口,会被ipvs转发到后端的Pod某个端口
- k8s-node-1组件异常:
因为192.168.0.10在k8s-node-1也有,当k8s-node-1的kublet和kube-proxy与k8s-master-1的apiserver通信时,会把流量转发到本机6443端口,但是ipvs没做这个转发,故而k8s-node-1组件无法与k8s-master-1通信了