> 文档中心 > 记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理


写在前面

  • 不小心拔错电源了,虚机强制关机,开机后集群死掉了
  • 记录下解决方案
  • 断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理
  • 可以找专业的 DBA来处理数据看有没有可能恢复
  • 这篇博文的解决办法是删除了 etcd 数据目录中的部分文件
  • 集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。
  • 理解不足小伙伴帮忙指正
  • 不管是生产还是测试, k8s集群 ETCD 一定要备份,ETCD 一定要备份,ETCD 一定要备份 ,重要的话说三遍。

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》


当前集群的状态

┌──[root@vms81.liruilongs.github.io]-[~]└─$kubectl get nodesThe connection to the server 192.168.26.81:6443 was refused - did you specify the right host or port?

重启 docke 和 kubelet 尝试启动

┌──[root@vms81.liruilongs.github.io]-[~]└─$systemctl restart docker┌──[root@vms81.liruilongs.github.io]-[~]└─$systemctl restart kubelet.service

还是不行,查看下 maser 节点的 kubelet 日志信息

┌──[root@vms81.liruilongs.github.io]-[~]└─$journalctl  -u kubelet.service -f119 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.703418   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"119 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.804201   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"119 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.905156   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.005487   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.105648   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.186066   11344 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.205785   11344 kubelet.go:2407] "Error getting node" err="node \"vms81.liruilongs.github.io\" not found"

利用 docker 查看下当前存在的 pod 信息

┌──[root@vms81.liruilongs.github.io]-[~]└─$docker psCONTAINER ID   IMAGE     COMMAND    CREATED   STATUSPORTS     NAMESd9d6471ce936   b51ddc1014b0     "kube-scheduler --au…"   17 minutes ago   Up 17 minutes   k8s_kube-scheduler_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_14010c1b8c30c6   5425bcbd23c5     "kube-controller-man…"   17 minutes ago   Up 17 minutes   k8s_kube-controller-manager_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_157e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   18 minutes ago   Up About a minute      k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7f557435d150e   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   18 minutes ago   Up 18 minutes   k8s_POD_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_75deaffbc555a   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   18 minutes ago   Up 18 minutes   k8s_POD_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_7a418c2ce33f2   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   18 minutes ago   Up 18 minutes   k8s_POD_kube-apiserver-vms81.liruilongs.github.io_kube-system_a35cb37b6c90c72f607936b33161eefe_6

etcd 没有启动, apiservice 也没有启动。

┌──[root@vms81.liruilongs.github.io]-[~]└─$docker ps -a | grep etcdb5e18722315b   004811815584     "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago      k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_197e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   21 minutes ago   Up 4 minutes  k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7

尝试重新启动 etcd

┌──[root@vms81.liruilongs.github.io]-[~]└─$docker restart b5e18722315bb5e18722315b

查看启动状态

┌──[root@vms81.liruilongs.github.io]-[~]└─$docker ps -a | grep etcdb5e18722315b   004811815584     "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago      k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_197e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   21 minutes ago   Up 4 minutes  k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7┌──[root@vms81.liruilongs.github.io]-[~]└─$docker logs b5e18722315b

看一下 etcd 对应的日志

┌──[root@vms81.liruilongs.github.io]-[~]└─$docker logs 8a53cbc545e4..................................................{"level":"info","ts":"2023-01-19T01:34:24.332Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"5.557212ms"}{"level":"warn","ts":"2023-01-19T01:34:24.332Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000014-0000000000185aba.wal.broken"}{"level":"info","ts":"2023-01-19T01:34:24.770Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":26912747,"snapshot-size":"42 kB"}{"level":"warn","ts":"2023-01-19T01:34:24.771Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":26912747,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000019aa7eb.snap.db","error":"snap: snapshot file doesn't exist"}{"level":"panic","ts":"2023-01-19T01:43:31.738Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}panic: failed to recover v3 backend from snapshotgoroutine 1 [running]:go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000114600, 0xc000588240, 0x1, 0x1) /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58dgo.uber.org/zap.(*Logger).Panic(0xc000080960, 0x122e2fc, 0x2a, 0xc000588240, 0x1, 0x1) /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe54af1e25, 0x1a, 0x0, 0x0, 0x0, 0x0, 0xc0004cf830, 0x1, 0x1, 0xc0004cfa70, ...) /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0000ee000, 0xc0000ee600, 0x0, 0x0) /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0000ee000, 0x1202a6f, 0x6, 0xc000428401, 0x2) /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12) /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257ago.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12) /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11fmain.main() /tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45

"msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","

“msg”: “从快照恢复v3后台失败”, “error”: “未能找到数据库快照文件(snap: 快照文件不存在)”,"

断电照成数据文件损坏了,它希望从快照中恢复,但是没有快照。

额,这里没有备份,所以基本上是没有办法修复了。只能通过 kubeadm 重置集群了。

一些补救措施

如果说你希望通过一些其他的方式来启动集群,来获取一些当前集群的配置信息,下面的方式可以尝试,但是我的集群使用了下面的方法,所有的 pods 数据都丢失了,没办法最后重置集群了。

如果你想使用下面的方式,一定要备份删除的 etcd 数据文件

etcd master 是一个静态 pod ,所以我们看下 yaml 文件,配置的数据文件中什么位置

┌──[root@vms81.liruilongs.github.io]-[~]└─$cd /etc/kubernetes/manifests/┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]└─$lsetcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

- --data-dir=/var/lib/etcd

┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]└─$cat etcd.yaml | grep -e "--"    - --advertise-client-urls=https://192.168.26.81:2379    - --cert-file=/etc/kubernetes/pki/etcd/server.crt    - --client-cert-auth=true    - --data-dir=/var/lib/etcd    - --initial-advertise-peer-urls=https://192.168.26.81:2380    - --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380    - --key-file=/etc/kubernetes/pki/etcd/server.key    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379    - --listen-metrics-urls=http://127.0.0.1:2381    - --listen-peer-urls=https://192.168.26.81:2380    - --name=vms81.liruilongs.github.io    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt    - --peer-client-cert-auth=true    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt    - --snapshot-count=10000    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

对应的数据文件,可以尝试对数据文件进行修复,如果希望集群可以快速启动,可以

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member]└─$tree.├── snap│   ├── 0000000000000058-00000000019a0ba7.snap│   ├── 0000000000000058-00000000019a32b8.snap│   ├── 0000000000000058-00000000019a59c9.snap│   ├── 0000000000000058-00000000019a80da.snap│   ├── 0000000000000058-00000000019aa7eb.snap│   └── db└── wal    ├── 0000000000000014-0000000000185aba.wal.broken    ├── 0000000000000142-0000000001963c0e.wal    ├── 0000000000000143-0000000001977bbe.wal    ├── 0000000000000144-0000000001986aa6.wal    ├── 0000000000000145-0000000001995ef6.wal    ├── 0000000000000146-00000000019a544d.wal    └── 1.tmp2 directories, 13 files

备份一下数据文件

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$lsmember┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$tar -cvf member.tar member/member/member/snap/member/snap/dbmember/snap/0000000000000058-00000000019a0ba7.snapmember/snap/0000000000000058-00000000019a32b8.snapmember/snap/0000000000000058-00000000019a59c9.snapmember/snap/0000000000000058-00000000019a80da.snapmember/snap/0000000000000058-00000000019aa7eb.snapmember/wal/member/wal/0000000000000142-0000000001963c0e.walmember/wal/0000000000000144-0000000001986aa6.walmember/wal/0000000000000014-0000000000185aba.wal.brokenmember/wal/0000000000000145-0000000001995ef6.walmember/wal/0000000000000146-00000000019a544d.walmember/wal/1.tmpmember/wal/0000000000000143-0000000001977bbe.wal┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$lsmember  member.tar┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$mv member.tar  /tmp/┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$rm -rf  member/snap/*.snap┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$rm -rf  member/wal/*.wal┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$

重新启动 docker 对应的镜像,或者重新启动 kubectl。

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$docker ps -a | grep etcda3b97cb34d9b   004811815584     "etcd --advertise-cl…"   2 minutes ago   Exited (2) 2 minutes agok8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_457e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   3 hours ago     Up 2 hoursk8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$docker start a3b97cb34d9ba3b97cb34d9b┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$docker ps -a | grep etcde1fc068247af   004811815584     "etcd --advertise-cl…"   3 seconds ago   Up 2 seconds     k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_46a3b97cb34d9b   004811815584     "etcd --advertise-cl…"   3 minutes ago   Exited (2) 3 seconds agok8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_457e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"   3 hours ago     Up 2 hoursk8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$

查看 Node 状态

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$kubectl get nodesNAME     STATUS   ROLES    AGE   VERSIONvms155.liruilongs.github.io   Ready    <none>   76s   v1.22.2vms81.liruilongs.github.io    Ready    <none>   76s   v1.22.2vms82.liruilongs.github.io    Ready    <none>   76s   v1.22.2vms83.liruilongs.github.io    Ready    <none>   76s   v1.22.2┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]└─$

查看集群当前所有的 Pod 。

┌──[root@vms81.liruilongs.github.io]-[~/ansible/kubevirt]└─$kubectl get pods -ANAMEREADY   STATUS    RESTARTS  AGEetcd-vms81.liruilongs.github.io 1/1     Running   48 (3h35m ago)   3h53mkube-apiserver-vms81.liruilongs.github.io     1/1     Running   48 (3h35m ago)   3h51mkube-controller-manager-vms81.liruilongs.github.io   1/1     Running   17 (3h35m ago)   3h51mkube-scheduler-vms81.liruilongs.github.io     1/1     Running   16 (3h35m ago)   3h52m

网络相关的 pod 都不在了,而且 k8s 的 dns 组件也没有起来, 这里需要 重新配置网络,有点麻烦,正常情况下如果, 网络相关的组件没有起来, 所有节点应该都是未就绪状态。感觉有点妖。。。时间关系,我需要集群来做实验,所以通过 kubeadm重置了

┌──[root@vms81.liruilongs.github.io]-[~/ansible]└─$kubectl apply -f calico.yaml

博文参考


https://github.com/etcd-io/etcd/issues/11949