Uploaded image for project: 'Erlang/OTP'
  1. Erlang/OTP
  2. ERL-90

Race condition in inet:tcp_controlling_process



    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 18.2.1
    • Fix Version/s: 19.3
    • Component/s: kernel
    • Labels:



      I have found a race condition in how gen_tcp:controlling_process (which calls inet:tcp_controlling_process) does handoff in a situation where the child process is not barrier synchronized with the parent.

      This might not be a bug, per se, but if not, then unexpected or undocumented behaviour at the least. If so, perhaps a warning to the documentation should be added.

      Whether this is a bug or not hinges on the desired behavior of gen_tcp:controlling_process with regards to changes in the port’s active state. Unfortunately documentation is of no help here as it is very terse (see http://erlang.org/doc/man/gen_tcp.html#controlling_process-2): "Assigns a new controlling process Pid to Socket. The controlling process is the process which receives messages from the socket."

      First I’ll give you a bit more discussion on controlling_process behavior followed by a short summary of the (potential) race condition, then some discussion followed by more detailed description and a test case.

      Behavior of controlling_process

      First while gen_tcp:controlling_process does not make any mention of what happens to messages from a socket that might be in the queue of the parent process. It is possible that an accept’d socket receives data before control of the socket is handed over. Consider the following server loop (CODE SAMPLE 1):

      server_loop(S) ->
           {ok,C} = gen_tcp:accept(S),
           gen_tcp:setopts(C, [{active, true}]),
           Pid = spawn(fun() -> handle_client(C) end),
           gen_tcp:controlling_process(C, Pid),
      handle_client(C) ->
              … socket C input ...

      While documentation makes no reference here, it seems that the general expectation is that gen_tcp:controlling_process (actually inet:tcp_controlling_process) does “something magic” and moves packets from the old controlling process's queue to the new controlling process.

      See http://stackoverflow.com/questions/11409656/erlang-avoiding-race-condition-with-gen-tcpcontrolling-process which asserts that the case with messages already in parent’s queue are automatically moved to the child’s queue. The problem of reliably changing control of a socket is already discussed in http://fullof.bs/a-better-erlang-tcp-listening-pattern-addressingthe-fast-packet-loss-problem/ (albeit the proposed solution does not work). Also http://erlangcentral.org/wiki/index.php/Building_a_Non-blocking_TCP_server_using_OTP_principles has a solution that (purposefully or accidentally) handles passing of socket control in a way that has no race condition (already in the first version from 2007).

      The actual implementation of inet:tcp_controlling_process (https://github.com/erlang/otp/blob/master/lib/kernel/src/inet.erl#L1531) makes an attempt to move packets to the new controlling process:

      case {tcp_sync_input(S, NewOwner, false), SetOptRes} of
                  {true, _} ->  %% socket already closed
                  {false, ok} ->
                      try erlang:port_connect(S, NewOwner) of
                      true ->
                          unlink(S), %% unlink from port

      where tcp_sync_input will match messages for socket S and re-send them to NewOwner.

      This synchronizing behavior was present already in R13B03 release (https://github.com/erlang/otp/blob/84adefa331c4159d432d22840663c38f155cd4c1/lib/kernel/src/inet.erl#L1238) and has been modified only slightly after (better error handling, mostly). While the behavior is not documented it has clearly been written purposefully with the goal of ensuring that no packets are left in parent’s queue during handoff of the socket to another process. Please note that the code handles one race condition successfully (by setting socket to

      {active, false} state) — this ensures that the CODE SAMPLE 1 above will never lose any packets. However it does not handle another possible race condition (which is impossible actually to handle, which I’ll later show).

      h2. Race condition

      When SMP is enabled (-smp enable) with more than one scheduler (either the number of cores >1, or increased via +S) it is possible that in tcp_controlling_process between calls to tcp_sync_input and erlang:port_connect one of these can happen:

      # The socket is set to {active, true} state and one or more packets are received and put into the parent process queue (it is the controlling process until port_connect is complete).
      # The socket is set to {active, once} state and a single packet is received and put to parent’s queue.

      The only requirements to this to happen is that 1) SMP is enabled and 2) child process starts working with the socket before its ownership is transferred. A small change to the first code shows this (CODE SAMPLE 2):


      server_loop(S) ->
      {ok,C} = gen_tcp:accept(S),
      Pid = spawn(fun() -> handle_client(C) end),
      gen_tcp:controlling_process(C, Pid),

      handle_client(C) ->
      gen_tcp:setopts(C, [{active, once}]),
      … socket C input ...
      It is possible that 1) child process will set the socket to {active, once} state after tcp_sync_input has finished but before controlling process has changed, and 2) data arrives on the TCP socket, and 3) input on TCP socket is noticed and 4) input is handled and it, as a message, is put to current controlling process’s queue (which at this point is still the parent process, not child).
      See the attached inet_race_test escript program. If no messages are left on the parent process’s queue then it should not output anything. Yet when I run it locally (MBP OSX) I get the following output:
      $ ./inet_race_test
      Starting server loop with race condition
      #175 PARENT <0.2.0> Received: ping.
      #681 PARENT <0.2.0> Received: ping.
      #797 PARENT <0.2.0> Received: ping.
      #827 PARENT <0.2.0> Received: ping.
      #950 PARENT <0.2.0> Received: ping.
      #1490 PARENT <0.2.0> Received: ping.

      h2. Details, details ...

      The inet:tcp_controlling_process is old. I don’t know whether it pre-dates introduction of SMP (in R11B). The code works perfectly in non-concurrent (non-SMP) environments. It will only break if the I/O handler is run truly in parallel.

      The problem here is that the code makes the (pretty natural for Erlang programs) assumption that there is no shared state. Unfortunately the underlying socket resource has state that can be simultanously accessed and modified by multiple processes — it is a truly concurrently shared resource. Thus the usual problems with true concurrency come to light — unexpected interleaving of conflicting execution sequences in this case.

      Now for some BEAM digging ...

      When SMP is enabled, BEAM will schedule IO poll on one of the schedulers (either with a timeout until next known wakeup if there are fewer running tasks than schedulers, or just poll once for any pending IO if runnable processes exist) from scheduler_wait. This goes to erl_sys_schedule, which in turn calls erts_check_io (or erts_check_io_nkp or the win32/ose version depending on environment). This IO poll will run concurrently to other processes (no global locks held!). This goes on to erts_poll_wait, then check_fd_events at which epoll/kpoll/select is used to check which fds are ready for IO. The information on ready fds then propagates back to erts_check_io which then schedules new IO tasks on the ports (iready/oready/eready in erts/emulator/sys/common/erl_check_io.c).

      This task then will eventually get scheduled on the correct driver (tcp_inet_driver_entry in common/inet_drv.c since we’re discussing TCP sockets, although note that similar problems may occur with other file descriptor based resources, I haven’t checked) which in turn will then finally read the socket and put the message to the current controlling process's message queue: tcp_inet_input -> tcp_recv -> tcp_deliver -> tcp_reply_binary_data -> multiple paths but eventually erl_drv_output_term which after some additional steps finally puts the TCP input as a message to the queue of the controlling process of the socket.

      And there is absolutely only local locking in all of this. No long-term locks. It is entirely possible to schedule another process that will eventually call erl_drv_output_term on the not-yet-disconnected parent process's input port when after tcp_sync_check (if someone, in parallel, changes the socket state to active/once).

      (I actually ran into this race condition in 2014 and wrote about it in http://santtu.iki.fi/2014/03/06/network-and-parallelism-in-erlang/. A while ago I got an email from Robert Gionea who pointed out an error in my logic, which got me digging more into what actually happens. It turns out that the explanation I gave for the race condition was incorrect, but the race condition itself was real, and caused by a slightly different conditions as explained here. Also at that point I thought that the behavior was how it was supposed to work, but since then I came to realize it is not clear what is actually the desired behavior — and there are clues that it is not working as intended.)

      h2. Fixing it

      The fix is really simple in code using gen_tcp. The problem is that the child process makes an assumption that it is the controlling process. The code just has to make sure that assumption holds by adding barrier synchronization (CODE SAMPLE 3):


      %% BTW this assumes that S is in {active, false}

      state so that C starts in that state too
      server_loop(S) ->


      = gen_tcp:accept(S),
      Pid = spawn(fun() -> receive Socket -> handle_client(Socket) end end),
      gen_tcp:controlling_process(C, Pid),
      Pid ! C,

      In this scenario handle_client/1 is called only when control of the socket is guaranteed to be child’s.
      The attached test program also contains this safe server loop variant and can be used via -safe option:
      $ ./inet_race_test -safe
      Starting safe server loop

      At least on my machine it will not print out anything. I actually forgot it running for about 8 hours and got nothing. (Running without -safe flag usually produces output within a few seconds.)

      Fixing it in Erlang kernel

      Nope. Can’t be done.

      At least not without adding a mutual exclusion lock to ensure that there are no changes to socket state between tcp_sync_input and port_connect. The core core core problem here is that the program assumes tcp_controlling_process is the sole process operating on the socket. It assumes atomicity where there is none.

      (Mutexes in Erlang kernel. Eeeech!!)

      So I see three potential solutions:

      • Not a bug. Add notice to the documentation that you must not change socket state while controlling_process is being called (e.g. “add barrier synchronization”).
      • Not a bug, and current behavior is undefined behavior that should not be relied on. E.g. just scrap the whole synchronization mechanism. As far as I can see this would not change the documented behavior one bit. (But it probably would break tons of programs, though.)
      • Bug. I don’t know how really. Maybe modify calls to gen_tcp that change socket state to make sure they do not run concurrently to tcp_controlling_process (e.g. add a big hunking mutex).




            lukas Lukas Larsson
            santtu santtu
            OTP team:
            0 Vote for this issue
            4 Start watching this issue