When TCP client process is killed with SIGKILL, TCP connection port remains after `port_close/1` returns `ok`. This results in ports that are connected to processes which do not exist. When the beam process is asked to `init:stop/0`, it hangs and never stops. A full CPU core is maxed out in what seems to be `epoll_wait`
`gdb bt` output of a beam process that is failing to stop:
gdb -p 12232 (gdb) bt #0 0x00007f39f71cf283 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00000000005b1009 in check_fd_events (max_res=512, timeout_time=-9223372036854775808, ps=0x7f39b6ec0060) at sys/common/erl_poll.c:2273 #2 erts_poll_wait_kp (ps=0x7f39b6ec0060, pr=pr@entry=0x7f39f83c0028, len=len@entry=0x7ffc292e8790, timeout_time=timeout_time@entry=-9223372036854775808) at sys/common/erl_poll.c:2460 #3 0x00000000005b3662 in erts_check_io_kp (do_wait=0) at sys/common/erl_check_io.c:1646 #4 0x00000000004d80d9 in scheduler_wait (rq=0x7f39b6ec6300, esdp=0x7f39b6ec6540, fcalls=<synthetic pointer>) at beam/erl_process.c:3263 #5 erts_schedule (esdp=0x7f39b6ec6540, p=<optimized out>, calls=<optimized out>) at beam/erl_process.c:9813 #6 0x000000000043dc2a in process_main () at beam/beam_emu.c:1338 #7 0x0000000000477750 in erl_start (argc=79, argv=<optimized out>) at beam/erl_init.c:2306 #8 0x000000000043a4d9 in main (argc=<optimized out>, argv=<optimized out>) at sys/unix/erl_main.c:30
This is the line where port is closed: https://github.com/rabbitmq/rabbitmq-common/blob/44eba79a3ba7ee5a6a20d6f2e0b3479f30bdb865/src/rabbit_net.erl#L199-L200
This is called from https://github.com/rabbitmq/rabbitmq-common/blob/44eba79a3ba7ee5a6a20d6f2e0b3479f30bdb865/src/rabbit_reader.erl#L415 . We have applied some additional logging:
diff --git a/src/rabbit_reader.erl b/src/rabbit_reader.erl index b10e8b1..bc7a35a 100644 --- a/src/rabbit_reader.erl +++ b/src/rabbit_reader.erl @@ -412,11 +412,14 @@ start_connection(Parent, HelperSup, Deb, Sock) -> %% the socket. However, to keep the file_handle_cache %% accounting as accurate as possible we ought to close the %% socket w/o delay before termination. - rabbit_net:fast_close(Sock), + R = rabbit_net:fast_close(Sock), + rabbit_log:error("CLOSING PORT WITH SOCK ~p", [Sock]), + rabbit_log:error("RESULT FROM FAST_CLOSE ~p IN PROCESS PID ~p", [R, self()]), rabbit_networking:unregister_connection(self()), rabbit_core_metrics:connection_closed(self()), rabbit_event:notify(connection_closed, [{pid, self()}]) end, + rabbit_log:error("CLOSING RABBIT READER WITH PID ~p", [self()]), done. log_connection_exception(Name, Ex) ->
This is the output that we see in the logs:
=ERROR REPORT==== 25-May-2017::15:21:27 === CLOSING PORT WITH SOCK #Port<0.35375> RESULT FROM FAST_CLOSE ok IN PROCESS PID <0.1523.0> CLOSING RABBIT READER WITH PID <0.1523.0> closing AMQP connection <0.1602.0> (10.0.1.11:44950 -> 10.0.16.34:5672, vhost: '/', user: 'redeliver'): client unexpectedly closed TCP connection
We are using the following code to find ports with connected processes that do not exist on a running Erlang VM:
Ports = erlang:ports(), TcpPortInfos = [erlang:port_info(Port) || Port <- Ports, erlang:port_info(Port, name) =:= {name, "tcp_inet"}], lists:map( fun (PortInfo) -> Pid = proplists:get_value(connected, PortInfo), ProcessInfo = erlang:process_info(Pid, [initial_call, current_function, dictionary]), [{process_info, ProcessInfo} | PortInfo] end, TcpPortInfos ).
Find the output attached in find_ports.log
Attaching erl_crash_running.dump & erl_crash_fails_to_stop.dump
Running this experiment on AWS where the network is throttled makes it easy to reproduce. Happy to help with reproducing this in a live environment.