> 技术文档 > FATAL: terminating connection due to session timeout [1022502] (ar_odbc_stmt.c:4828)_fatal terminating connection

FATAL: terminating connection due to session timeout [1022502] (ar_odbc_stmt.c:4828)_fatal terminating connection


问题描述:

AWS DMS将Redshift作为目标端,出现了报错,日志里是如下报错显示:

00885557: 2023-08-21T10:32:15 [TARGET_APPLY ]I: Truncate table statement: TRUNCATE TABLE \"test\".\"test\" (cloud_imp.c:3088)00885557: 2023-08-21T10:32:15 [DMS_INTERNAL ]I: Failed (retcode -1) to execute statement: \'TRUNCATE TABLE \"test\".\"test\"\' (ar_odbc_stmt.c:138).....1.69261E+12,2023-08-21T10:32:15 [TARGET_APPLY ]E: Failed (retcode -1) to execute statement [1022502] (ar_odbc_stmt.c:4820)1.69261E+12,2023-08-21T10:32:15 [TARGET_APPLY ]E: RetCode: SQL_ERROR SqlState: 57P01 NativeError: 30 Message: [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: [SQLState 57P01] FATAL: terminating connection due to session timeout [1022502] (ar_odbc_stmt.c:4828)1.69261E+12,2023-08-21T10:32:15 [TASK_MANAGER ]W: Table \'test\'.\'test\' (subtask 0 thread 1) is suspended (replicationtask.c:2550)

分析过程 及 解决方案:

1. 首先,报错中是执行了一个truncate table语句然后报错的。table error了,但是其实这个事件本身和truncate table语句没什么关系。就算执行一个别的DML也会发生这个报错。

2. 根据DMS文档[1], Amazon Redshift 的默认空闲会话超时时间为 4 小时。如果 DMS 复制任务中没有任何活动,Redshift 会在 4 小时后断开会话连接。 Errors can result from DMS being unable to connect and potentially needing to restart. As a workaround, set a SESSION TIMEOUT limit greater than 4 hours for the DMS replication user.

3. 报错日志中的这句话,就指向了超时相关:

1.69261E+12,2023-08-21T10:32:15 [TARGET_APPLY ]E: RetCode: SQL_ERROR SqlState: 57P01 NativeError: 30 Message: [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: [SQLState 57P01] FATAL: terminating connection due to session timeout [1022502] (ar_odbc_stmt.c:4828) 

“FATAL: terminating connection due to session timeout” 的报错与文档中的4小时空闲会话会导致Redshift断开连接的限制相符。

4. 如果再深入查看,那么就查看日志中,Target_Apply或者Target_Load部分,上一条信息。看看上一次Redshfit目标端有DMS写入的时间,是不是恰好timestamp是4小时前(甚至更早)。

我们的示例:

00885623: 2023-08-21T01:27:44 [TARGET_LOAD ]I: handling COPY from S3 via IAM Role (cloud_imp.c:2271)00885625: 2023-08-21T01:27:45 [TARGET_LOAD ]I: Load finished for table \'metadata\'.\'sys_param\' (Id = 8). 20 rows received. 0 rows skipped. Volume transferred 15728. (streamcomponent.c:3838)00885625: 2023-08-21T01:27:45 [TARGET_LOAD ]I: handling COPY from S3 via IAM Role (cloud_imp.c:2271)

这两条Target log信息之间仅有源端信息。也就是说,DMS任务在启动以后,在UTC时间01:27结束了fullload,并完成了在Target Redshift的写入。而后在10:32开始执行truncate。1:27 - 10:32中间的相隔时间远超过4小时。这个时间差也验证了我们对于报错的分析。

5. 在AWS Redshift 中测试,一个空闲连结,在超过了 session timeout的设置的时长以后,如果再次运行指令,会出现如下报错:

dev=# select getdate();FATAL: terminating connection due to session timeout //----> 与我们的DMS报错相符SSL connection has been closed unexpectedly The connection to the server was lost. Attempting reset: Succeeded.

6. 根据文档[1], 我们建议将DMS user 在Redshift端的timeout时间延长。这样可以避免log中出现的session timeout的报错[2]:
ALTER USER SESSION TIMEOUT 86400;

---> 将dms使用的user的session timeoout设置为86400秒,也就是24小时。

需要注意的是,在运行了类似 ALTER USER  SESSION TIMEOUT 86400; 的指令以后,用DMS User 当前连结的用户session并不会生效,只有在运行了alter的时间之后新建立的session,才会采用新的session timeout设置。

参考文档:
[1] Redshift作为DMS目标:
https://docs.amazonaws.cn/dms/latest/userguide/CHAP_Target.Redshift.html 
[2] Redshift alter user 指令:
https://docs.amazonaws.cn/redshift/latest/dg/r_ALTER_USER.html