docker+prometheus+grafana+alertmanager钉钉机器人报警通知

文档中心

你知道吗？在数字化的今天，服务器就像一位辛勤工作的员工，日夜不停地为你服务。而监控系统就像是这位员工的“健康管家”，实时关注它的“身体状况”。今天我们就来聊聊，如何用Prometheus搭建一个高效、智能的监控系统，让你的服务器“健康”一目了然！

【相关问题1】：为什么企业需要监控系统？
在当今快节奏的IT环境中，服务器故障可能在瞬间导致业务中断。一个可靠的监控系统可以及时发现异常，避免潜在的灾难。比如，当服务器CPU usage飙升到90%以上，监控系统会立即发出警告，就像一声紧急警报，提醒你采取行动！

【相关问题2】：如何选择适合企业的监控方案？
Prometheus作为开源监控领域的“明星”，以其强大的功能和灵活性成为首选。结合Grafana的数据可视化能力，以及钉钉的即时通知功能，能构建一个全面、高效的监控体系。比如，当服务器磁盘使用率超过80%时，不仅Prometheus会记录，Grafana还会以直观的图表展示，钉钉机器人也会第一时间推送通知。

【相关问题3】：搭建这样的系统真的值得吗？
答案是肯定的！通过自动化监控和报警，企业可以显著减少停机时间，提高运维效率。比如，传统的手动检查可能需要运维人员24小时值守，而通过Prometheus+Grafana+钉钉的组合，可以做到7×24小时全天候监控。当系统检测到节点宕机时，不仅会自动触发报警，还能提供详细的故障信息，让运维人员快速定位问题。这不仅节省了人力资源，更重要的是提升了企业的整体运维水平。

想了解更多细节，或者你有自己的问题，欢迎在评论区留言交流！让我们一起用技术的力量，构建更智能、更高效的运维体系！

目录结构

[root@node1 ~]# tree prom
prom
├── docker-compose.yml #docker-compose文件
├── grafana #grafana数据挂载
├── prometheus_data #Prometheus数据挂载
├── rules #报警规则文件
│   ├── cpu_over.yml
│   ├── disk_over.yml
│   ├── memory_over.yml
│   └── node_alived.yml
└── yml
├── alertmanager.yml alertmanager配置
├── config.yml 钉钉机器人配置
└── prometheus.yml Prometheus配置

[root@node1 prom]# cat docker-compose.yml version: "3.7"services:  node-exporter:    image: prom/node-exporter:latest    container_name: "node-exporter"    ports:      - "9100:9100"    restart: always  cadvisor:    image: google/cadvisor:latest    container_name: cadvisor    restart: always    ports:      - '8080:8080'  prometheus:    image: prom/prometheus:latest    container_name: prometheus    ports:      - "9090:9090"    restart: always    volumes:      - "./yml/prometheus.yml:/etc/prometheus/prometheus.yml"      - "./prometheus_data:/prometheus"      - "./rules:/etc/prometheus/rules"  grafana:    image: grafana/grafana    container_name: "grafana"    ports:      - "3000:3000"    restart: always    volumes:      - "./grafana:/var/lib/grafana"  alertmanager:    image: prom/alertmanager:latest    restart: "always"    ports:      - 9093:9093    container_name: "alertmanager"    volumes:      - "./yml/alertmanager.yml:/etc/alertmanager/alertmanager.yml"  webhook:    image: timonwong/prometheus-webhook-dingtalk    restart: "always"    ports:      - 8060:8060    container_name: "webhook"      volumes:      - "./yml/config.yml:/etc/prometheus-webhook-dingtalk/config.yml"

[root@node1 prom]# cat yml/prometheus.yml # my global configglobal:  # 此片段指定的是prometheus的全局配置， 比如采集间隔，抓取超时时间等.  scrape_interval: 1m  # 抓取间隔 默认1m  evaluation_interval: 1m   # 评估规则间隔 默认1m  # scrape_timeout is set to the global default (10s).# Alertmanager configuration # 此片段指定报警配置， 这里主要是指定prometheus将报警规则推送到指定的alertmanager实例地址alerting:   alertmanagers:    - static_configs: - targets:    - 192.168.10.10:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.rule_files:   - "/etc/prometheus/rules/*.yml"   #报警规则文件#  - "cpu_over.yml"#  - "disk_over.yml"#  - "memory_over.yml"#  - "node_alived.yml"# A scrape configuration containing exactly one endpoint to scrape:# Here it's Prometheus itself.# 抓取配置列表scrape_configs:  - job_name: "prometheus"    static_configs:      - targets: ["localhost:9090"]   - job_name: "linux"    static_configs:      - targets: ["192.168.10.10:9100","192.168.10.10:8080","192.168.10.20:9100","192.168.10.20:8080"]

[root@node1 prom]#cat alertmanager.ymlglobal:  resolve_timeout: 5m  #在指定时间内没有新的事件就发送恢复通知route:  receiver: webhook  #设置接收人  group_wait: 1m  #组告警等待时间。在等待时间结束后，如果有同组告警一起发出  group_interval: 1m  #两组告警间隔时间。  repeat_interval: 1m  #重复告警间隔时间，减少相同邮件的发送频率。  group_by: [alertname] #采用那个标签来作为分组。receivers:   #通知接收者列表- name: webhook  webhook_configs:  - url: http://192.168.10.10:8060/dingtalk/webhook1/send   send_resolved: true[root@node1 prom]# cat yml/config.yml targets:  webhook1:    url: https://oapi.dingtalk.com/robot/send?access_token=XXXXXX    #webhook    secret: SEC000000    #加签

[root@node1 prom]# cat rules/rules.yml groups:- name: CPU报警规则  rules:  - alert: CPU使用率告警    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50    for: 1m    labels:      severity: warning    annotations:      summary: "CPU使用率正在飙升。"      description: "CPU使用率超过50%（当前值：{{ $value }}%）"- name: 磁盘使用率报警规则  rules:  - alert: 磁盘使用率告警    expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80    for: 20m    labels:      severity: warning    annotations:      summary: "硬盘分区使用率过高"      description: "分区使用大于80%（当前值：{{ $value }}%）"- name: 内存报警规则  rules:  - alert: 内存使用率告警    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 50    for: 1m    labels:      severity: warning    annotations:      summary: "服务器可用内存不足。"      description: "内存使用率已超过50%（当前值：{{ $value }}%）"- name: 实例存活告警规则  rules:  - alert: 实例存活告警    expr: up == 0    for: 10s    labels:      user: prometheus      severity: warning    annotations:      summary: "主机宕机 !!!"      description: "该实例主机已经宕机超过一分钟了。"

配置完成后docker-compose up -d 启动容器

http://localhost:8080 #cadvisor
http://localhost:8080/metrics #cadvisor数据
http://localhost:9100/metrics #node-exporter数据
http://localhost:9090 #prometheus
http://localhost:3000 #grafana

http://localhost:9090/alerts

实现效果

jcg路由器知识网

docker+prometheus+grafana+alertmanager钉钉机器人报警通知

目录结构

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签