STNSが応答しない

STNSが動作せずログインができない

タイムライン

2020/11/22

15:47(JST) 事象を確認

スクリーンショット・ログ

lsofで接続を確認したところ1000を超える接続が確認された.

sudo lsof -i:1104
(略)
stns    11405 root  989u  IPv6 8231237      0t0  TCP base-j:1104->192.168.100.136:37008 (ESTABLISHED)
stns    11405 root  990u  IPv6 8231239      0t0  TCP base-j:1104->192.168.100.138:33948 (ESTABLISHED)
stns    11405 root  991u  IPv6 8231241      0t0  TCP base-j:1104->192.168.100.97:47994 (ESTABLISHED)
stns    11405 root  992u  IPv6 8231243      0t0  TCP base-j:1104->192.168.100.166:40108 (ESTABLISHED)
stns    11405 root  993u  IPv6 8231245      0t0  TCP base-j:1104->192.168.100.211:56914 (ESTABLISHED)
stns    11405 root  994u  IPv6 8231247      0t0  TCP base-j:1104->192.168.100.137:42756 (ESTABLISHED)
stns    11405 root  995u  IPv6 8231249      0t0  TCP base-j:1104->192.168.100.135:34094 (ESTABLISHED)
stns    11405 root  996u  IPv6 8231251      0t0  TCP base-j:1104->192.168.100.90:37546 (ESTABLISHED)
stns    11405 root  997u  IPv6 8232107      0t0  TCP base-j:1104->192.168.100.3:48922 (ESTABLISHED)
stns    11405 root  998u  IPv6 8232295      0t0  TCP base-j:1104->192.168.100.211:59072 (ESTABLISHED)
stns    11405 root  999u  IPv6 8232538      0t0  TCP base-j:1104->192.168.100.211:59756 (ESTABLISHED)
stns    11405 root 1000u  IPv6 8232648      0t0  TCP base-j:1104->192.168.100.3:48944 (ESTABLISHED)
stns    11405 root 1001u  IPv6 8233718      0t0  TCP base-j:1104->192.168.100.3:48990 (ESTABLISHED)
stns    11405 root 1002u  IPv6 8233720      0t0  TCP base-j:1104->192.168.100.3:48992 (ESTABLISHED)
stns    11405 root 1003u  IPv6 8235320      0t0  TCP base-j:1104->192.168.100.3:49064 (ESTABLISHED)
stns    11405 root 1004u  IPv6 8235322      0t0  TCP base-j:1104->192.168.100.3:49066 (ESTABLISHED)
stns    11405 root 1005u  IPv6 8235403      0t0  TCP base-j:1104->192.168.100.211:37756 (ESTABLISHED)
stns    11405 root 1006u  IPv6 8236077      0t0  TCP base-j:1104->192.168.100.211:39226 (ESTABLISHED)
stns    11405 root 1007u  IPv6 8236079      0t0  TCP base-j:1104->192.168.100.211:39228 (ESTABLISHED)
stns    11405 root 1008u  IPv6 8236081      0t0  TCP base-j:1104->192.168.100.211:39230 (ESTABLISHED)
stns    11405 root 1009u  IPv6 8237371      0t0  TCP base-j:1104->192.168.100.3:49142 (ESTABLISHED)
stns    11405 root 1010u  IPv6 8237373      0t0  TCP base-j:1104->192.168.100.3:49144 (ESTABLISHED)
stns    11405 root 1011u  IPv6 8238121      0t0  TCP base-j:1104->192.168.100.135:34768 (ESTABLISHED)
stns    11405 root 1012u  IPv6 8238123      0t0  TCP base-j:1104->192.168.100.160:39686 (ESTABLISHED)
stns    11405 root 1013u  IPv6 8238125      0t0  TCP base-j:1104->192.168.100.204:59974 (ESTABLISHED)
stns    11405 root 1014u  IPv6 8238127      0t0  TCP base-j:1104->192.168.100.214:50068 (ESTABLISHED)
stns    11405 root 1015u  IPv6 8238129      0t0  TCP base-j:1104->192.168.100.137:43492 (ESTABLISHED)
stns    11405 root 1016u  IPv6 8238131      0t0  TCP base-j:1104->192.168.100.161:39802 (ESTABLISHED)
stns    11405 root 1017u  IPv6 8238133      0t0  TCP base-j:1104->192.168.100.71:33770 (ESTABLISHED)
stns    11405 root 1018u  IPv6 8238164      0t0  TCP base-j:1104->192.168.100.6:37758 (ESTABLISHED)
stns    11405 root 1019u  IPv6 8238166      0t0  TCP base-j:1104->192.168.100.152:46948 (ESTABLISHED)
stns    11405 root 1020u  IPv6 8238168      0t0  TCP base-j:1104->192.168.100.136:37724 (ESTABLISHED)
stns    11405 root 1021u  IPv6 8238170      0t0  TCP base-j:1104->192.168.100.138:34678 (ESTABLISHED)
stns    11405 root 1022u  IPv6 8238172      0t0  TCP base-j:1104->192.168.100.97:57412 (ESTABLISHED)
stns    11405 root 1023u  IPv6 8238174      0t0  TCP base-j:1104->192.168.100.166:54974 (ESTABLISHED)

プロセス自体の本来のファイルディスクリプタ上限を確認する.Max open filesのSoft Limitを見ると1024になっていることがわかる.

sudo cat /proc/11405/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             3766                 3766                 processes
Max open files            1024                 4096                 files
Max locked memory         16777216             16777216             bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       3766                 3766                 signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

ファイルディスクリプタの数を確認したところ,FD数の枯渇が見受けられた.FD数は /proc/$PID/fd で確認できる.

sudo ls /proc/11405/fd | wc -l
1024

原因

確立されたTCPのコネクションが1024に達したことで,プロセスのファイルディスクリプタが枯渇した.

その結果,STNSで新たな接続を確立できず障害が発生した.

対応

(1) ファイルディスクリプタの上限をsystemdのサービスファイルで引き上げる.

ファイルディスクリプタ数を拡張する方法 - Qiita

$ sudo vi /etc/systemd/system/stns.service

[Service]
Type=simple
PIDFile=/var/run/stns.pid
ExecStartPre=/usr/sbin/stns --pidfile /var/run/stns.pid --logfile /var/log/stns.log checkconf
ExecStart=/usr/sbin/stns --pidfile /var/run/stns.pid --logfile /var/log/stns.log server
KillMode=process
Restart=always
User=root
Group=root
## Serviceの末尾に追記
LimitNOFILE=65535

変更したら sudo systemctl daemon-reload && sudo systemctl restart stns を実行する.

(2) STNSのTCP Timeoutまでの時間を調整する.

https://stns.jp/en/configuration

TODO: request_timeoutかrequest_locktimeかcacheの調整を検討

(3) monitによる自動再起動