STNSが応答しない
STNSが動作せずログインができない
タイムライン
2020/11/22
15:47(JST) 事象を確認
スクリーンショット・ログ
lsofで接続を確認したところ1000を超える接続が確認された.
sudo lsof -i:1104
(略)
stns 11405 root 989u IPv6 8231237 0t0 TCP base-j:1104->192.168.100.136:37008 (ESTABLISHED)
stns 11405 root 990u IPv6 8231239 0t0 TCP base-j:1104->192.168.100.138:33948 (ESTABLISHED)
stns 11405 root 991u IPv6 8231241 0t0 TCP base-j:1104->192.168.100.97:47994 (ESTABLISHED)
stns 11405 root 992u IPv6 8231243 0t0 TCP base-j:1104->192.168.100.166:40108 (ESTABLISHED)
stns 11405 root 993u IPv6 8231245 0t0 TCP base-j:1104->192.168.100.211:56914 (ESTABLISHED)
stns 11405 root 994u IPv6 8231247 0t0 TCP base-j:1104->192.168.100.137:42756 (ESTABLISHED)
stns 11405 root 995u IPv6 8231249 0t0 TCP base-j:1104->192.168.100.135:34094 (ESTABLISHED)
stns 11405 root 996u IPv6 8231251 0t0 TCP base-j:1104->192.168.100.90:37546 (ESTABLISHED)
stns 11405 root 997u IPv6 8232107 0t0 TCP base-j:1104->192.168.100.3:48922 (ESTABLISHED)
stns 11405 root 998u IPv6 8232295 0t0 TCP base-j:1104->192.168.100.211:59072 (ESTABLISHED)
stns 11405 root 999u IPv6 8232538 0t0 TCP base-j:1104->192.168.100.211:59756 (ESTABLISHED)
stns 11405 root 1000u IPv6 8232648 0t0 TCP base-j:1104->192.168.100.3:48944 (ESTABLISHED)
stns 11405 root 1001u IPv6 8233718 0t0 TCP base-j:1104->192.168.100.3:48990 (ESTABLISHED)
stns 11405 root 1002u IPv6 8233720 0t0 TCP base-j:1104->192.168.100.3:48992 (ESTABLISHED)
stns 11405 root 1003u IPv6 8235320 0t0 TCP base-j:1104->192.168.100.3:49064 (ESTABLISHED)
stns 11405 root 1004u IPv6 8235322 0t0 TCP base-j:1104->192.168.100.3:49066 (ESTABLISHED)
stns 11405 root 1005u IPv6 8235403 0t0 TCP base-j:1104->192.168.100.211:37756 (ESTABLISHED)
stns 11405 root 1006u IPv6 8236077 0t0 TCP base-j:1104->192.168.100.211:39226 (ESTABLISHED)
stns 11405 root 1007u IPv6 8236079 0t0 TCP base-j:1104->192.168.100.211:39228 (ESTABLISHED)
stns 11405 root 1008u IPv6 8236081 0t0 TCP base-j:1104->192.168.100.211:39230 (ESTABLISHED)
stns 11405 root 1009u IPv6 8237371 0t0 TCP base-j:1104->192.168.100.3:49142 (ESTABLISHED)
stns 11405 root 1010u IPv6 8237373 0t0 TCP base-j:1104->192.168.100.3:49144 (ESTABLISHED)
stns 11405 root 1011u IPv6 8238121 0t0 TCP base-j:1104->192.168.100.135:34768 (ESTABLISHED)
stns 11405 root 1012u IPv6 8238123 0t0 TCP base-j:1104->192.168.100.160:39686 (ESTABLISHED)
stns 11405 root 1013u IPv6 8238125 0t0 TCP base-j:1104->192.168.100.204:59974 (ESTABLISHED)
stns 11405 root 1014u IPv6 8238127 0t0 TCP base-j:1104->192.168.100.214:50068 (ESTABLISHED)
stns 11405 root 1015u IPv6 8238129 0t0 TCP base-j:1104->192.168.100.137:43492 (ESTABLISHED)
stns 11405 root 1016u IPv6 8238131 0t0 TCP base-j:1104->192.168.100.161:39802 (ESTABLISHED)
stns 11405 root 1017u IPv6 8238133 0t0 TCP base-j:1104->192.168.100.71:33770 (ESTABLISHED)
stns 11405 root 1018u IPv6 8238164 0t0 TCP base-j:1104->192.168.100.6:37758 (ESTABLISHED)
stns 11405 root 1019u IPv6 8238166 0t0 TCP base-j:1104->192.168.100.152:46948 (ESTABLISHED)
stns 11405 root 1020u IPv6 8238168 0t0 TCP base-j:1104->192.168.100.136:37724 (ESTABLISHED)
stns 11405 root 1021u IPv6 8238170 0t0 TCP base-j:1104->192.168.100.138:34678 (ESTABLISHED)
stns 11405 root 1022u IPv6 8238172 0t0 TCP base-j:1104->192.168.100.97:57412 (ESTABLISHED)
stns 11405 root 1023u IPv6 8238174 0t0 TCP base-j:1104->192.168.100.166:54974 (ESTABLISHED)
プロセス自体の本来のファイルディスクリプタ上限を確認する.Max open filesのSoft Limitを見ると1024になっていることがわかる.
sudo cat /proc/11405/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 3766 3766 processes
Max open files 1024 4096 files
Max locked memory 16777216 16777216 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 3766 3766 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
ファイルディスクリプタの数を確認したところ,FD数の枯渇が見受けられた.FD数は /proc/$PID/fd で確認できる.
sudo ls /proc/11405/fd | wc -l
1024
原因
確立されたTCPのコネクションが1024に達したことで,プロセスのファイルディスクリプタが枯渇した.
その結果,STNSで新たな接続を確立できず障害が発生した.
対応
(1) ファイルディスクリプタの上限をsystemdのサービスファイルで引き上げる.
$ sudo vi /etc/systemd/system/stns.service
[Service]
Type=simple
PIDFile=/var/run/stns.pid
ExecStartPre=/usr/sbin/stns --pidfile /var/run/stns.pid --logfile /var/log/stns.log checkconf
ExecStart=/usr/sbin/stns --pidfile /var/run/stns.pid --logfile /var/log/stns.log server
KillMode=process
Restart=always
User=root
Group=root
## Serviceの末尾に追記
LimitNOFILE=65535
変更したら sudo systemctl daemon-reload && sudo systemctl restart stns
を実行する.
(2) STNSのTCP Timeoutまでの時間を調整する.
https://stns.jp/en/configuration
TODO: request_timeoutかrequest_locktimeかcacheの調整を検討
(3) monitによる自動再起動