oracle to openGauss: 迁移后中间件socket closed，这锅DB不背

有一个项目从oracle迁移到了opengauss（MogDB发行版）后，有部分应用在运行一段时间后会超时，日志中一些Socket closed错误，执行的是从数据库中unload一些查询数据离线存储，常见的问题有网络防火墙，或有一些timeout配置，或网络闪断等，逐一排除，当然在出问题时，应用厂家可能出于责任原因并不会坦诚，变更的是DB，会把怀疑方向指向DB, 但最终确认是中间件配置问题，这里简单记录一下.

错误日志

xxx socket is close; send urgent packt failed, detail: socket closed. An I/O error occured while sending to the backend.detail:socket closed;

因为应用是个tomcat可以独立运行，我们把该应用迁移到了db server上，之前也确认没有过firewall , 是时应用端和DB端的OS 层有配置keepalive参数增加网络包探测，但问题依旧，放到一台服务器排除网络问题，后问题依旧。经过反复测试发现是只要超过10分钟就会出错。

查询DB 参数

openGauss=# \! sh show timeout
 archive_timeout                        | 0                  | Forces a switch to the next xlog file if a new file has not been started within N seconds.
 authentication_timeout                 | 1min               | Sets the maximum allowed time to complete client authentication.
 autoanalyze_timeout                    | 5min               | Sets the timeout for auto-analyze action.
 basebackup_timeout                     | 10min              | Sets the timeout in seconds for a reponse from gs_basebackup.
 checkpoint_timeout                     | 15min              | Sets the maximum time between automatic WAL checkpoints.
 checkpoint_wait_timeout                | 1min               | Sets the maximum wait timeout for checkpointer to start.
 dcf_connect_timeout                    | 60000              | Sets the connect timeout of local DCF node.
 dcf_election_timeout                   | 3                  | Sets the election timeout of local DCF node.
 dcf_socket_timeout                     | 5000               | Sets the socket timeout of local DCF node.
 deadlock_timeout                       | 1s                 | Sets the time to wait on a lock before checking for deadlock.
 fault_mon_timeout                      | 5min               | how many miniutes to monitor lwlock. 0 will disable that
 gpc_clean_timeout                      | 30min              | Set the maximum allowed duration of any unused global plancache.
 gs_clean_timeout                       | 1min               | Sets the timeout to call gs_clean.
 idle_in_transaction_session_timeout    | 0                  | Sets the maximum allowed idle time between queries, when in a transaction.
 incremental_checkpoint_timeout         | 1min               | Sets the maximum time between automatic WAL checkpoints.
 lockwait_timeout                       | 20min              | Sets the max time to wait on a lock acquire.
 logical_sender_timeout                 | 30s                | Sets the maximum time to wait for logical replication.
 partition_lock_upgrade_timeout         | 1800               | Sets the timeout for partition lock upgrade, in seconds
 pldebugger_timeout                     | 15min              | Sets the receive timeout (s) of pldebugger.
 session_timeout                        | 10min              | Set the maximum allowed duration of any unused session.
 statement_timeout                      | 0                  | Sets the maximum allowed duration of any statement.
 tcp_user_timeout                       | 0                  | Maximum timeout of TCP retransmits.
 update_lockwait_timeout                | 2min               | Sets the max time to wait on a lock acquire when concurrently update same tuple.
 wal_flush_timeout                      | 2                  | set timeout when iterator table entry.
 wal_receiver_connect_timeout           | 2s                 | Sets the maximum wait time to connect master.
 wal_receiver_timeout                   | 6s                 | Sets the maximum wait time to receive data from master.
 wal_sender_timeout                     | 6s                 | Sets the maximum time to wait for WAL replication.
 wdr_snapshot_query_timeout             | 100s               | Sets the timeout for wdr snapshot query, in seconds

note:
这里只时演示方法，并非生产

检查OS 参数

sysctl -a|egrep -i 'tcp|timeout'|egrep -w '600|10|600000'

没有相关参数，我甚至还想写段java 代码测试是否有自动中断现象。

-- jdbc_conn_og.java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;

public class jdbc_conn_og {

public static Connection getConnect(String username, String passwd)
    {
        //驱动类。
        String driver = "org.opengauss.Driver";
        //数据库连接描述符。
        String sourceURL = "jdbc:opengauss://192.168.56.**:6432/anbob";
        Connection conn = null;

        try
        {
            //加载驱动。
            Class.forName(driver);
        }
        catch( Exception e )
        {
            e.printStackTrace();
            return null;
        }

        try
        {
             //创建连接。
            conn = DriverManager.getConnection(sourceURL, username, passwd);
            System.out.println("Connection succeed!");
        }
        catch(Exception e)
        {
            e.printStackTrace();
            return null;
        }

        return conn;
    };

  public static void main(String[] args) {

         Connection conn = getConnect("anbob","anbob_****");

         long start = System.currentTimeMillis();
     try {
         Thread.sleep(700000);
     } catch (InterruptedException e) {
         e.printStackTrace();
     }
     long end = System.currentTimeMillis();
     System.out.println(end-start + "毫秒过去了");

         try{
        if (conn.isClosed()) {
                System.out.println("Connection is closed.");
            } else {
                System.out.println("Connection is still open.");
            }
         } catch (SQLException e) {
         System.out.println(e.getMessage());
     }

  }
}

并不存在数据库SESSION 中断操作，后来确认socket closed还并不是session closed, 只能要到应用的所有日志，从其中找到了突破口。

 WARNING  #0001 was active for 603925 milliseconds and has been removed automaticaly. The Thread responsible was named ‘zk_scheduler_worker-1′, 
but the last SQL it performed is unknown because the trace property is .....

从日志中发现了这类WARNING时间确实都在600s被自动removed, 根据这错误找到了是proxool tomcat中的连接池配置，之前应用一直说使用的jdbc直连并未使用connect pool， proxool有一个参数加maximum-active-time，我们看一下官方网站解释

maximum-active-time:
If the housekeeper comes across a thread that has been active for longer than this then it will kill it. So make sure you set this to a number bigger than your slowest expected response! Default is 5 minutes.

一个线程持有一个连接的最长时间，而不管这个连接是否处于 active 状态, 并且如果线程的持有时间超过这个时间的之后会自动清除掉这个连接. 默认为5分钟，单位是ms，把参数提供给应用，确认当前参数在中间件配置的是600s，调整后不再提示该类错误。

显示这个问题与数据库无关，还是那句在Troubleshooting时“Do not believe anyone unless you see.” 但为什么之前一直没有出问题呢？后来怀疑1是刚长线测试的数据尽可能多，2是当前这套库的本地盘远低于过去oracle的共享存储吞吐量。