AI 加持:Amazon VPC Direct Connect 路由监控系统构建实践
AI 加持:Amazon VPC Direct Connect 路由监控系统构建实践
新用户可获得高达 200 美元的服务抵扣金
亚马逊云科技新用户可以免费使用亚马逊云科技免费套餐(Amazon Free Tier)。注册即可获得 100 美元的服务抵扣金,在探索关键亚马逊云科技服务时可以再额外获得最多 100 美元的服务抵扣金。使用免费计划试用亚马逊云科技服务,最长可达 6 个月,无需支付任何费用,除非您选择付费计划。付费计划允许您扩展运营并获得超过 150 项亚马逊云科技服务的访问权限。
前言
企业云架构中,VPC 与本地数据中心通过 Direct Connect 建立的网络连接稳定性至关重要,传统路由监控常面临异常难察觉、人工优化效率低等问题,本文聚焦如何结合 AI 技术与亚马逊云科技原生服务,构建一套能实时追踪路由状态、智能分析路由聚合并自动预警的监控系统,助力企业降低运维成本,保障网络通信稳定
技术架构
该架构以 Amazon Cloud 为基础,本地数据中心经 Direct Connect Gateway 与 VPC 借 BGP 传播路由,EventBridge 定时触发 Lambda Function,调用 EC2 API 查询 VPC 路由表,超阈值时经 SNS Topic、邮件订阅发预警,CloudWatch Logs 记录日志,构建了 “本地 - 云端路由交互 + 定时监控 + 智能预警 + 日志留存” 的 VPC Direct Connect 路由监控体系,保障网络路由稳定与异常及时响应
- Amazon EventBridge:按预设间隔触发监控任务,支持灵活配置频率,企业可选每半小时 / 1 小时监控,平衡问题响应效率与成本
- Amazon Lambda:作为核心监控逻辑载体,查询 VPC 路由表、识别 Direct Connect 传播路由并完成统计分析,实现路由状态的核心检查
- Amazon SNS:接收监控异常信息并分发预警,通过邮件推送含摘要、路由详情及优化建议的通知,助力运维快速响应
- Amazon IAM:遵循最小权限原则,仅授予 Lambda 查询路由表和发送 SNS 通知的必要权限,最大化降低监控过程的安全风险
前提准备:亚马逊云科技注册流程
Step.1 登录官网
登录亚马逊云科技官网,填写邮箱和账户名称完成验证(注册亚马逊云科技填写 root 邮箱、账户名,验证邮件地址,查收邮件填验证码验证,验证通过后设 root 密码并确认)
Step.2 选择账户计划
选择账户计划,两种计划,按需选\"选择免费计划 / 选择付费计划\"继续流程
- 免费(6 个月,适合学习实验,含$200抵扣金、限精选服务,超限额或到期可升级付费,否则关停)
- 付费(适配生产,同享$200 抵扣金,可体验全部服务,抵扣金覆盖广,用完按即用即付计费)
Step.3 填写联系人信息
填写联系人信息(选择使用场景,填联系人全名、电话,选择所在国家地区,完善地址、邮政编码,勾选同意客户协议,点击继续 进入下一步)
Step.4 绑定信息
绑定相关信息,选择国家地区,点击\"Send code\"收验证码填写,勾选同意协议后,点击\"验证并继续\"进入下一步
Step.5 电话验证
电话验证填写真实手机号,选择验证方式,完成安全检查,若选语音,网页同步显 4 位数字码,接来电后输入信息,再填收到的验证信息,遇问题超 10 分钟收不到可返回重试。
Step.6 售后支持
售后支持:免费计划自动获基本支持,付费计划需选支持计划(各计划都含客户服务,可访问文档白皮书,按需选后点 “完成注册”,若需企业级支持可了解付费升级选项,确认选好即可完成整个注册流程 )
Amazon VPC Direct Connect 路由监控系统
1、下载 CloudFormation 内容到本地,并保存为 yaml 格式
AWSTemplateFormatVersion: \'2010-09-09\'Description: \'VPC Direct Connect Route Monitor with AI-powered route aggregation analysis\'Parameters: VpcId: Type: AWS::EC2::VPC::Id Description: Select the VPC to monitor for DX propagated routes MaxRoutes: Type: Number Default: 100 MinValue: 1 MaxValue: 1000 Description: Maximum number of routes limit (default 100) WarningThreshold1: Type: Number Default: 60 MinValue: 1 MaxValue: 100 Description: First warning threshold percentage (default 60%) WarningThreshold2: Type: Number Default: 80 MinValue: 1 MaxValue: 100 Description: Second warning threshold percentage (default 80%) MonitoringFrequency: Type: String Default: \'1 hour\' AllowedValues: - \'5 minutes\' - \'10 minutes\' - \'30 minutes\' - \'1 hour\' - \'1 day\' Description: How often to check the route count (default 1 hour) NotificationEmail: Type: String Description: Email address to receive alert notifications AllowedPattern: \'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$\' ConstraintDescription: Please enter a valid email address EnableAIAnalysis: Type: String Default: \'false\' AllowedValues: - \'true\' - \'false\' Description: Enable AI-powered route aggregation analysis using Amazon Bedrock Nova Lite BedrockRegion: Type: String Default: \'us-east-1\' AllowedValues: - \'us-east-1\' - \'us-west-2\' - \'eu-west-1\' - \'ap-southeast-1\' - \'ap-northeast-1\' - \'eu-north-1\' Description: AWS region where Bedrock is available (default us-east-1)Conditions: Is5Minutes: !Equals [!Ref MonitoringFrequency, \'5 minutes\'] Is10Minutes: !Equals [!Ref MonitoringFrequency, \'10 minutes\'] Is30Minutes: !Equals [!Ref MonitoringFrequency, \'30 minutes\'] Is1Day: !Equals [!Ref MonitoringFrequency, \'1 day\'] AIAnalysisEnabled: !Equals [!Ref EnableAIAnalysis, \'true\']Resources: AlertTopic: Type: AWS::SNS::Topic Properties: TopicName: !Sub \'${AWS::StackName}-vpc-dx-route-alerts\' DisplayName: \'VPC DX Route Monitor Alerts\' EmailSubscription: Type: AWS::SNS::Subscription Properties: TopicArn: !Ref AlertTopic Protocol: email Endpoint: !Ref NotificationEmail LambdaExecutionRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: \'2012-10-17\' Statement: - Effect: Allow Principal: Service: lambda.amazonaws.com Action: sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Policies: - PolicyName: VPCRouteMonitoringPolicy PolicyDocument: Version: \'2012-10-17\' Statement: - Effect: Allow Action: - ec2:DescribeRouteTables - ec2:DescribeVpcs Resource: \'*\' - Effect: Allow Action: - sns:Publish Resource: !Ref AlertTopic - !If - AIAnalysisEnabled - PolicyName: BedrockAccessPolicy PolicyDocument: Version: \'2012-10-17\' Statement: - Effect: Allow Action: - bedrock:InvokeModel Resource: !Sub \'arn:aws:bedrock:${BedrockRegion}::foundation-model/amazon.nova-lite-v1:0\' - !Ref \'AWS::NoValue\' RouteMonitorFunction: Type: AWS::Lambda::Function Properties: FunctionName: !Sub \'${AWS::StackName}-vpc-dx-route-monitor\' Runtime: python3.9 Handler: index.lambda_handler Role: !GetAtt LambdaExecutionRole.Arn Timeout: 600 MemorySize: 512 Description: \'Monitor VPC DX propagated routes with optional AI analysis\' Environment: Variables: VPC_ID: !Ref VpcId MAX_ROUTES: !Ref MaxRoutes WARNING_THRESHOLD_1: !Ref WarningThreshold1 WARNING_THRESHOLD_2: !Ref WarningThreshold2 SNS_TOPIC_ARN: !Ref AlertTopic ENABLE_AI_ANALYSIS: !Ref EnableAIAnalysis BEDROCK_REGION: !Ref BedrockRegion Code: ZipFile: | #!/usr/bin/env python3 \"\"\" AWS VPC DX路由监控Lambda函数 - AI增强版本 支持通过Amazon Bedrock Nova Lite进行路由聚合分析 \"\"\" import boto3 import json import os from datetime import datetime from typing import Dict, List, Any, Set, Tuple, Optional def lambda_handler(event, context): \"\"\"Lambda主函数\"\"\" try: vpc_id = os.environ.get(\'VPC_ID\') max_routes = int(os.environ.get(\'MAX_ROUTES\', \'100\')) warning_threshold_1 = int(os.environ.get(\'WARNING_THRESHOLD_1\', \'60\')) warning_threshold_2 = int(os.environ.get(\'WARNING_THRESHOLD_2\', \'80\')) sns_topic_arn = os.environ.get(\'SNS_TOPIC_ARN\') enable_ai_analysis = os.environ.get(\'ENABLE_AI_ANALYSIS\', \'false\').lower() == \'true\' bedrock_region = os.environ.get(\'BEDROCK_REGION\', \'us-east-1\') if not vpc_id or not sns_topic_arn:raise ValueError(\"缺少必要的环境变量\") # 查询DX传播的路由 route_info = query_dx_propagated_routes(vpc_id) if route_info is None:send_error_notification(sns_topic_arn, vpc_id, \"查询路由失败\")return {\'statusCode\': 500, \'body\': json.dumps({\'error\': \'查询路由失败\'})} # 计算使用百分比 current_routes = route_info[\'unique_dx_routes\'] usage_percentage = (current_routes / max_routes) * 100 # AI分析(如果启用) ai_analysis = None if enable_ai_analysis and current_routes > 0:try: ai_analysis = analyze_routes_with_ai(route_info[\'unique_routes\'], bedrock_region) print(\"AI分析完成\")except Exception as e: print(f\"AI分析失败: {e}\") ai_analysis = {\"error\": f\"AI分析失败: {str(e)}\"} # 检查预警 should_alert = False alert_level = None if usage_percentage >= warning_threshold_2:should_alert = Truealert_level = \'HIGH\' elif usage_percentage >= warning_threshold_1:should_alert = Truealert_level = \'MEDIUM\' # 发送预警 if should_alert:send_alert_notification( sns_topic_arn, vpc_id, current_routes, max_routes, usage_percentage, alert_level, route_info[\'unique_routes\'], ai_analysis) result = {\'vpc_id\': vpc_id,\'unique_dx_routes\': current_routes,\'max_routes\': max_routes,\'usage_percentage\': round(usage_percentage, 2),\'alert_sent\': should_alert,\'alert_level\': alert_level,\'ai_analysis_enabled\': enable_ai_analysis,\'ai_analysis_status\': \'completed\' if ai_analysis and \'error\' not in ai_analysis else \'failed\' if ai_analysis else \'disabled\',\'timestamp\': datetime.now().isoformat() } print(f\"监控结果: {json.dumps(result, ensure_ascii=False)}\") return {\'statusCode\': 200, \'body\': json.dumps(result, ensure_ascii=False)} except Exception as e: error_msg = f\"Lambda执行失败: {str(e)}\" print(error_msg) try:if \'sns_topic_arn\' in locals(): send_error_notification(sns_topic_arn, vpc_id if \'vpc_id\' in locals() else \'Unknown\', str(e)) except:pass return {\'statusCode\': 500, \'body\': json.dumps({\'error\': error_msg})} def query_dx_propagated_routes(vpc_id: str) -> Dict[str, Any]: \"\"\"查询VPC中通过DX传播的路由条目\"\"\" try: ec2_client = boto3.client(\'ec2\') response = ec2_client.describe_route_tables(Filters=[{\'Name\': \'vpc-id\', \'Values\': [vpc_id]}] ) route_tables = response.get(\'RouteTables\', []) if not route_tables:print(f\"未找到VPC {vpc_id} 的路由表\")return None if response.get(\'NextToken\'):print(\"警告: 检测到分页,可能需要升级到完整版本\") # 使用Set去重 unique_dx_routes: Set[Tuple[str, str, str]] = set() unique_routes_list = [] for rt in route_tables:rt_id = rt[\'RouteTableId\']routes = rt.get(\'Routes\', [])for route in routes: if route.get(\'Origin\') == \'EnableVgwRoutePropagation\': destination = route.get(\'DestinationCidrBlock\') or route.get(\'DestinationIpv6CidrBlock\', \'未知\') target_type, target_value = get_route_target(route) route_key = (destination, target_type, target_value) if route_key not in unique_dx_routes: unique_dx_routes.add(route_key) unique_routes_list.append({ \'route_table_id\': rt_id, \'destination\': destination, \'target_type\': target_type, \'target_value\': target_value, \'state\': route.get(\'State\', \'未知\') }) return {\'unique_dx_routes\': len(unique_dx_routes),\'unique_routes\': unique_routes_list } except Exception as e: print(f\"查询DX路由失败: {e}\") return None def analyze_routes_with_ai(routes: List[Dict], bedrock_region: str) -> Optional[Dict]: \"\"\"使用Amazon Bedrock Nova Lite分析路由聚合\"\"\" try: bedrock_client = boto3.client(\'bedrock-runtime\', region_name=bedrock_region) # 准备路由数据 route_data = [] for route in routes:route_data.append({ \'destination\': route[\'destination\'], \'target_type\': route[\'target_type\'], \'target_value\': route[\'target_value\'], \'state\': route[\'state\']}) # 构建AI提示 prompt = f\"\"\"你是一个AWS网络专家,请分析以下Direct Connect传播的路由,并提供路由聚合建议。 当前路由列表(共{len(routes)}条): {json.dumps(route_data, indent=2, ensure_ascii=False)} 请分析并提供以下内容: 1. 路由聚合机会分析 2. 具体的CIDR聚合建议 3. 预期的路由数量减少 4. 实施建议和注意事项 5. 风险评估 请用中文回答,格式要清晰易读。\"\"\" # 调用Nova Lite request_body = {\"messages\": [ { \"role\": \"user\", \"content\": [ { \"text\": prompt } ] }],\"inferenceConfig\": { \"max_new_tokens\": 4000, \"temperature\": 0.1} } response = bedrock_client.invoke_model(modelId=\'amazon.nova-lite-v1:0\',body=json.dumps(request_body) ) response_body = json.loads(response[\'body\'].read()) ai_analysis = response_body[\'output\'][\'message\'][\'content\'][0][\'text\'] return {\'analysis\': ai_analysis,\'route_count\': len(routes),\'analysis_timestamp\': datetime.now().isoformat(),\'model_used\': \'Amazon Nova Lite\' } except Exception as e: print(f\"AI分析失败: {e}\") return {\"error\": str(e)} def get_route_target(route: Dict) -> tuple: \"\"\"获取路由目标类型和值\"\"\" if \'VirtualPrivateGatewayId\' in route: return \'vpn-gateway\', route[\'VirtualPrivateGatewayId\'] elif \'TransitGatewayId\' in route: return \'transit-gateway\', route[\'TransitGatewayId\'] elif \'DirectConnectGatewayId\' in route: return \'dx-gateway\', route[\'DirectConnectGatewayId\'] elif \'GatewayId\' in route: return \'gateway\', route[\'GatewayId\'] elif \'NatGatewayId\' in route: return \'nat-gateway\', route[\'NatGatewayId\'] elif \'NetworkInterfaceId\' in route: return \'network-interface\', route[\'NetworkInterfaceId\'] elif \'InstanceId\' in route: return \'instance\', route[\'InstanceId\'] else: return \'unknown\', \'unknown\' def send_alert_notification(sns_topic_arn: str, vpc_id: str, current_routes: int, max_routes: int, usage_percentage: float, alert_level: str, routes: List[Dict], ai_analysis: Optional[Dict] = None): \"\"\"发送预警通知(包含AI分析)\"\"\" try: sns_client = boto3.client(\'sns\') subject = f\"🚨 VPC DX路由预警 - {alert_level} 级别 ({usage_percentage:.1f}%)\" if ai_analysis and \'error\' not in ai_analysis:subject += \" [含AI分析]\" message_lines = [f\"VPC Direct Connect 路由监控预警\",f\"\",f\"📊 监控摘要:\",f\" VPC ID: {vpc_id}\",f\" 当前DX传播路由数: {current_routes}\",f\" 最大路由限制: {max_routes}\",f\" 使用百分比: {usage_percentage:.2f}%\",f\" 预警级别: {alert_level}\",f\" 检查时间: {datetime.now().strftime(\'%Y-%m-%d %H:%M:%S UTC\')}\",f\"\",f\"🔍 DX传播路由详情:\" ] if routes:message_lines.append(f\" {\'目标网段\':<18} {\'目标类型\':<15} {\'目标值\':<25}\")message_lines.append(f\" {\'-\'*18} {\'-\'*15} {\'-\'*25}\")for route in routes[:15]: # 限制显示数量,为AI分析留空间 message_lines.append( f\" {route[\'destination\']:<18} {route[\'target_type\']:<15} {route[\'target_value\']:<25}\" )if len(routes) > 15: message_lines.append(f\" ... 还有 {len(routes) - 15} 条路由未显示\") # 添加AI分析结果 if ai_analysis:message_lines.extend([ f\"\", f\"🤖 AI路由聚合分析 (Amazon Nova Lite):\", f\"{\'=\'*60}\"])if \'error\' in ai_analysis: message_lines.append(f\"AI分析失败: {ai_analysis[\'error\']}\")else: # 将AI分析结果按行分割并添加适当的缩进 analysis_lines = ai_analysis[\'analysis\'].split(\'\\n\') for line in analysis_lines: if line.strip(): message_lines.append(f\"{line}\") else: message_lines.append(\"\") message_lines.extend([ f\"\", f\"分析时间: {ai_analysis.get(\'analysis_timestamp\', \'Unknown\')}\", f\"使用模型: {ai_analysis.get(\'model_used\', \'Unknown\')}\" ]) message_lines.extend([f\"\",f\"⚠️ 建议操作:\",f\" - 检查是否有不必要的路由传播\",f\" - 考虑优化路由聚合\",f\" - 如需增加路由限制,请联系AWS支持\" ]) if ai_analysis and \'error\' not in ai_analysis:message_lines.append(f\" - 参考上述AI分析建议进行路由优化\") message_lines.append(f\"\\n此消息由AWS Lambda自动生成\") sns_client.publish(TopicArn=sns_topic_arn,Subject=subject,Message=\"\\n\".join(message_lines) ) print(\"预警通知已发送\") except Exception as e: print(f\"发送预警通知失败: {e}\") def send_error_notification(sns_topic_arn: str, vpc_id: str, error_message: str): \"\"\"发送错误通知\"\"\" try: sns_client = boto3.client(\'sns\') subject = f\"❌ VPC DX路由监控错误 - {vpc_id}\" message = f\"\"\"VPC Direct Connect 路由监控执行错误 错误信息: VPC ID: {vpc_id} 错误消息: {error_message} 发生时间: {datetime.now().strftime(\'%Y-%m-%d %H:%M:%S UTC\')} 请检查Lambda函数配置和权限设置。\"\"\" sns_client.publish(TopicArn=sns_topic_arn,Subject=subject,Message=message ) except Exception as e: print(f\"发送错误通知失败: {e}\") ScheduleRule: Type: AWS::Events::Rule Properties: Name: !Sub \'${AWS::StackName}-vpc-dx-route-monitor-schedule\' Description: \'Schedule trigger for VPC DX route monitoring\' ScheduleExpression: !If - Is5Minutes - \'rate(5 minutes)\' - !If - Is10Minutes - \'rate(10 minutes)\' - !If - Is30Minutes - \'rate(30 minutes)\' - !If - Is1Day - \'rate(1 day)\' - \'rate(1 hour)\' State: ENABLED Targets: - Arn: !GetAtt RouteMonitorFunction.Arn Id: \'RouteMonitorTarget\' LambdaInvokePermission: Type: AWS::Lambda::Permission Properties: FunctionName: !Ref RouteMonitorFunction Action: lambda:InvokeFunction Principal: events.amazonaws.com SourceArn: !GetAtt ScheduleRule.Arn LambdaLogGroup: Type: AWS::Logs::LogGroup Properties: LogGroupName: !Sub \'/aws/lambda/${RouteMonitorFunction}\' RetentionInDays: 14Outputs: LambdaFunctionName: Description: \'Lambda function name for VPC DX route monitoring\' Value: !Ref RouteMonitorFunction SNSTopicArn: Description: \'SNS topic ARN for alert notifications\' Value: !Ref AlertTopic MonitoredVPC: Description: \'VPC ID being monitored\' Value: !Ref VpcId AIAnalysisEnabled: Description: \'Whether AI analysis is enabled\' Value: !Ref EnableAIAnalysis BedrockRegion: Description: \'Bedrock region for AI analysis\' Value: !Ref BedrockRegion Condition: AIAnalysisEnabled
2、网站控制台 CloudFormation 功能,上传刚才创建的 CloudFormation.yaml 文件
3、输入监控必须项目,填写预警所需邮箱,选择已经学习 DX 路由的 VPC,并点击下一步
4、勾选我确认,同意系统所需最小授权,并点击下一步完成堆栈部署
5、部署完成后,系统会自动开始工作,通过大模型优化
Amazon VPC Direct Connect 介绍
Amazon VPC Direct Connect 是亚马逊云科技提供的专用网络服务,通过物理专线或合作伙伴网络建立本地数据中心与 Amazon VPC 之间的私有连接,替代公共互联网,为混合云架构提供高速、稳定、安全的网络通道
- 低延迟高稳定:绕过公共互联网,减少网络抖动和延迟波动,保障实时业务(如金融交易、数据同步)的连续性
- 安全与成本优化:私有链路降低数据传输风险,固定带宽计费模式相比公网高频传输更节省长期成本
- 灵活扩展与集成:支持 1-100 Gbps 带宽按需扩展,可连接多 VPC、跨区域资源及亚马逊云科技服务,适配复杂混合云架构
总结
本文介绍的 AI 加持型 Amazon VPC Direct Connect 路由监控系统,通过 EventBridge 定时触发、Lambda 核心分析、SNS 预警通知的无服务器架构,实现路由状态实时监控与异常预警,并集成 Amazon Bedrock 大模型提供智能路由优化建议。借助 CloudFormation 自动化部署,兼顾安全与灵活性,有效降低运维成本,保障混合云网络连接的稳定可靠。