rancher使用rke在华为云多网卡的服务器上安装k8s集群问题处理

技术文档

报错:

问题：

[[network] Host [192.168.0.213] is not able to connect to the following ports: [192.168.0.213:2379]. Please check network policies and firewall rules]

问题：root@hwy-isms-210-66:~# gotelnet 172.17.210.66 2379map[2379:failed]root@hwy-isms-210-66:~# gotelnet 127.0.0.1 2379map[2379:success]root@hwy-isms-210-66:~# docker psCONTAINER ID IMAGE COMMAND  CREATED STATUS PORTS NAMESb6f75ff566d5 rancher/rke-tools:v0.1.96 \"/docker-entrypoint.…\" 6 hours ago Up 6 hours 80/tcp, 0.0.0.0:10250->1337/tcp rke-worker-port-listenerac3e20c949df rancher/rke-tools:v0.1.96 \"/docker-entrypoint.…\" 6 hours ago Up 6 hours 80/tcp, 0.0.0.0:6443->1337/tcp rke-cp-port-listenere106814143a3 rancher/rke-tools:v0.1.96 \"/docker-entrypoint.…\" 6 hours ago Up 6 hours 80/tcp, 0.0.0.0:2379->1337/tcp, 0.0.0.0:2380->1337/tcp rke-etcd-port-listener6a866546f8bb rancher/rancher-agent:v2.8.5 \"run.sh --server htt…\" 6 hours ago Up 6 hours peaceful_albattani9bbffd35d9a4 rancher/rancher-agent:v2.8.5 \"run.sh --server htt…\" 6 hours ago Up 6 hours confident_fermiroot@hwy-isms-210-66:~# ifconfig docker0: flags=4163 mtu 1500 inet 172.18.0.1 netmask 255.255.0.0 broadcast 172.18.255.255 ether a6:c3:99:d0:cf:03 txqueuelen 0 (Ethernet) RX packets 3547 bytes 100789 (98.4 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 86 bytes 5196 (5.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0eth0: flags=4163 mtu 1500 inet 172.17.210.66 netmask 255.255.255.0 broadcast 172.17.210.255 ether fa:16:3e:40:01:71 txqueuelen 1000 (Ethernet) RX packets 122941811 bytes 23935288095 (22.2 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 127262310 bytes 14351697946 (13.3 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0eth1: flags=4163 mtu 1500 inet 172.17.210.67 netmask 255.255.255.0 broadcast 172.17.210.255 ether fa:16:3e:40:01:72 txqueuelen 1000 (Ethernet) RX packets 207177 bytes 17420004 (16.6 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 202098 bytes 20182560 (19.2 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0eth2: flags=4163 mtu 1500 inet 172.17.210.68 netmask 255.255.255.0 broadcast 172.17.210.255 ether fa:16:3e:40:01:73 txqueuelen 1000 (Ethernet) RX packets 180108 bytes 15241156 (14.5 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 248119 bytes 22751922 (21.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73 mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 1352589 bytes 102392483 (97.6 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1352589 bytes 102392483 (97.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0veth13ea56c: flags=4163 mtu 1500 ether 7a:fc:db:8f:3c:0f txqueuelen 0 (Ethernet) RX packets 59 bytes 3636 (3.5 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 73 bytes 4338 (4.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0veth6b767de: flags=4163 mtu 1500 ether 7e:17:74:fd:a7:27 txqueuelen 0 (Ethernet) RX packets 3 bytes 126 (126.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6 bytes 412 (412.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0vethf9165ed: flags=4163 mtu 1500 ether f6:46:67:c2:93:2e txqueuelen 0 (Ethernet) RX packets 3 bytes 126 (126.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 9 bytes 538 (538.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0root@hwy-isms-210-66:~# cat /etc/rc.local #!/bin/sh -e# rc.local# 开机执行的路由配置命令ip route add default via 172.17.210.1 dev eth0 table 10ip route add 172.17.210.0/24 dev eth0 table 10ip rule add from 172.17.210.66 table 10ip route add default via 172.17.210.1 dev eth1 table 20ip route add 172.17.210.0/24 dev eth1 table 20ip rule add from 172.17.210.67 table 20ip route add default via 172.17.210.1 dev eth2 table 30ip route add 172.17.210.0/24 dev eth2 table 30ip rule add from 172.17.210.68 table 30exit 0root@hwy-isms-210-66:~# 为啥127.0.0.1 2379可以通，172.17.210.66 2379不通，且从同网段的服务器上是可以telnet通172.17.210.66 2379端口的。

该问题的根本原因在于网络路由策略限制，具体分析如下：

/etc/rc.local中配置了多网卡策略路由，强制不同源IP走不同路由表
从172.17.210.66发出的流量被ip rule add from 172.17.210.66 table 10强制路由
可能因路由表10缺少到docker0网桥(172.18.0.1/16)的路由导致不通

root@hwy-isms-210-66:~#  ip route listdefault via 172.17.210.1 dev eth0 proto dhcp metric 100 169.254.169.254 via 172.17.210.1 dev eth0 proto dhcp metric 100 172.17.210.0/24 dev eth0 proto kernel scope link src 172.17.210.66 metric 100 172.17.210.0/24 dev eth1 proto kernel scope link src 172.17.210.67 metric 101 172.17.210.0/24 dev eth2 proto kernel scope link src 172.17.210.68 metric 102 172.18.0.0/16 dev docker0 proto kernel scope link src 172.18.0.1 root@hwy-isms-210-66:~# ip rule list0:from all lookup local32763:from 172.17.210.68 lookup 3032764:from 172.17.210.67 lookup 2032765:from 172.17.210.66 lookup 1032766:from all lookup main32767:from all lookup defaultroot@hwy-isms-210-66:~# root@hwy-isms-210-66:~# ip route show table 10default via 172.17.210.1 dev eth0 172.17.210.0/24 dev eth0 scope link root@hwy-isms-210-66:~#

解决方案‌：

方案1：在路由表10中添加docker0网段路由

ip route add 172.18.0.0/16 dev docker0 table 10

处理完以上问题后，k8s集群是正常，但是网络插件calico还是无法正常安装，导致整个k8s集群网络异常。

根本原因还是以上路由表的规则导致的，calico维护的路由表规则是写入到路由表main中，
在路由表中从src 172.17.210.66是数据包会将路由写入到路由表table 10中，calico网络异常。

删除ip rule add from 172.17.210.66 table 10路由规则即可。或者配置calico，让他将维护的路由规则写入到table 10中也可以。

Calico 将其维护的路由规则 **只** 写入到 `table 10` 而不是 `main` 表，需要调整 Calico 的配置。以下是具体方法：

1. 修改 Calico 的 `FelixConfiguration
Calico 默认会将部分路由写入 `main` 表，但可以通过 `RouteTableRange` 配置来限制路由表范围。
1. 检查当前的 FelixConfiguration：

   kubectl get felixconfiguration default -o yaml

2. 修改 `RouteTableRange` 配置：

   kubectl patch felixconfiguration default \\     --type=\'merge\' \\     -p \'{\"spec\":{\"routeTableRange\":{\"min\":10,\"max\":10}}}\'

- 这样设置后，Calico 只会使用 `table 10`，而不会写入 `main` 表。

3. 验证是否生效：
- 等待 Calico 重新应用配置（可能需要重启 `calico-node` Pod）。
- 检查路由表：

     ip route show table 10   # 应该包含所有 Calico 路由     ip route show table main # 应该不再有 Calico 维护的路由

### **2. 确保 `ip rule` 正确指向 `table 10`**
Calico 默认会添加 `ip rule` 规则，确保 Pod 流量正确查询 `table 10`。检查：

ip rule list

预期输出类似：

0:      from all lookup local32765:  from 10.42.213.64/26 lookup 10  # Calico 的规则32766:  from all lookup main

- 如果缺少 `table 10` 的规则，可能需要调整 Calico 的配置或手动添加。

---

### **3. 调整 BIRD 配置（如果使用 BGP）**
如果 Calico 使用 BGP 协议同步路由，可能需要调整 BIRD 的配置，确保它只在 `table 10` 中操作。

#### **修改 `calico-node` 的 `configmap`：**

kubectl -n kube-system edit configmap calico-config

在 `bird_template` 部分，确保类似如下配置：

confprotocol kernel {    learn;    scan time 10;    import all;    export all;    kernel table 10;  # 明确指定使用 table 10}

### **4. 重启 `calico-node` Pod 使配置生效**

kubectl -n kube-system rollout restart daemonset calico-node

### **5. 清理 `main` 表中残留的 Calico 路由**
如果 `main` 表中仍有残留的 Calico 路由，可以手动删除（谨慎操作）：

ip route del blackhole 10.42.213.64/26 proto birdip route del 10.42.213.65 dev calid6b141b5a7c scope link

# 删除其他不需要的 Calico 路由...
```

---

### **验证**
- 检查 `table 10` 是否包含所有 Calico 路由：
```bash
ip route show table 10
```
- 检查 `main` 表是否不再有 Calico 路由：
```bash
ip route show table main
```
- 确保 Pod 网络仍然正常通信。

---

### **注意事项**
- **网络中断风险**：修改路由表可能导致短暂的网络中断，建议在维护窗口操作。
- **CNI 兼容性**：某些 CNI 插件或网络策略可能依赖 `main` 表，需测试兼容性。
- **备份路由表**：操作前建议备份当前路由表：
```bash
ip route save > ip_route_backup.txt
ip rule save > ip_rule_backup.txt
```

如果仍有问题，可以提供 `ip rule list` 和最新的路由表信息进一步排查。

rancher使用rke在华为云多网卡的服务器上安装k8s集群问题处理

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

rancher使用rke在华为云多网卡的服务器上安装k8s集群问题处理

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签