Uploaded image for project: 'Erlang/OTP'
  1. Erlang/OTP
  2. ERL-1278

erl_child_setup might sleep in select infinitely

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Help Wanted
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 22.2.7
    • Fix Version/s: None
    • Component/s: compiler, erts, kernel
    • Labels:

      Description

      Hi,

      I observed that a compilation process might hang from time to time on Solaris SPARC. It happens more often when the system is under heavy load (multiple build processes) but not reproducible on Solaris X86. Note that the hang is not project related, I observed a hang during the Elixir build as well as during RabbitMQ build...

      I found a bug on RedHat bugzilla that seems to be this issue:
      https://bugzilla.redhat.com/show_bug.cgi?id=1770256

      The proposed workaround to use ‘--disable-kernel-poll’ worked as well, I was not able to reproduce build hangs in many attempts. However, the bug is not in kernel (Solaris uses different kernel than Linux) nor specific to s390x architecture. Maybe, there is a bug in erts related to endianness of sparc and s390x architectures (Big Endian).

      I used gdb to attach to hung processes.

      erl_child_setup process:

      (gdb) backtrace
      #0  0x00007fffff0e1280 in __pollsys () from /lib/64/libc.so.1
      #1  0x00007fffff0cd5cc in _pollsys () from /lib/64/libc.so.1
      #2  0x00007fffff02c9c4 in pselect () from /lib/64/libc.so.1
      #3  0x00007fffff02cd68 in select () from /lib/64/libc.so.1
      #4  0x0000000100002848 in main (argc=<optimized out>, argv=<optimized out>) at sys/unix/erl_child_setup.c:483
      
      (gdb) frame 4
      #4  0x0000000100002848 in main (argc=<optimized out>, argv=<optimized out>) at sys/unix/erl_child_setup.c:483
      483             res = select(max_fd+1, &read_fds, NULL, NULL, NULL);
      
      (gdb) list
      478             FD_ZERO(&read_fds);
      479             FD_SET(uds_fd, &read_fds);
      480             FD_SET(sigchld_pipe[0], &read_fds);
      481             DEBUG_PRINT("child_setup selecting on %d, %d (%d)",
      482                     uds_fd, sigchld_pipe[0], max_fd);
      483             res = select(max_fd+1, &read_fds, NULL, NULL, NULL);
      484
      485             if (res < 0) {
      486                 if (errno == EINTR) continue;
      487                 ABORT("Select failed: %d (%d)",res, errno);
      
      (gdb) print sigchld_pipe
      $1 = {4, 5}
      
      (gdb) print uds_fd
      $2 = 3
      

      It is obvious, that erl_child_setup waits for an input event on uds_fd and sigchild_pipe[0] descriptors but the event never arrives.

      Parent process of erl_child_setup (showed only interesting backtraces):

      (gdb) thread 2
      [Switching to thread 2 (LWP    272        )]
      #0  0x00007fffff0e1070 in ioctl () from /lib/64/libc.so.1
      
      (gdb) backtrace 5
      #0  0x00007fffff0e1070 in ioctl () from /lib/64/libc.so.1
      #1  0x00000001002b3500 in check_fd_events (timeout_time=9223372036854775807, max_res=512, pr=0x1067d1568, ps=0x1067ce290) at sys/common/erl_poll.c:1841
      #2  erts_poll_wait (ps=0x1067ce290, pr=0x1067d1568, len=0xffffffff2e883cc8, tpd=0xffffffff73b0ccb0, timeout_time=9223372036854775807) at sys/common/erl_poll.c:1913
      #3  0x00000001002ba5fc in erts_check_io (psi=0x1067d1528, timeout_time=9223372036854775807) at sys/common/erl_check_io.c:1680
      #4  0x00000001000f7780 in poll_thread (arg=<optimized out>) at beam/erl_process.c:3256
      (More stack frames follow...)
      
      (gdb) frame 1
      #1  0x00000001002b3500 in check_fd_events (timeout_time=9223372036854775807, max_res=512, pr=0x1067d1568, ps=0x1067ce290) at sys/common/erl_poll.c:1841
      1841            res = ioctl(ps->kp_fd, DP_POLL, &poll_res);
      
      (gdb) list
      1836            struct dvpoll poll_res;
      1837            int nfds = (int) erts_atomic_read_nob(&ps->no_of_user_fds) + 1 /* wakeup pipe */;
      1838            poll_res.dp_nfds = nfds < max_res ? nfds : max_res;
      1839            poll_res.dp_fds = pr;
      1840            poll_res.dp_timeout = (int) get_timeout(ps, 1000, timeout_time);
      1841            res = ioctl(ps->kp_fd, DP_POLL, &poll_res);
      1842    #elif ERTS_POLL_USE_POLL && defined(HAVE_PPOLL) /* --- ppoll ---------------- */
      1843            struct timespec ts;
      1844            struct timespec *tsp = &ts;
      1845            timeout = get_timeout_timespec(ps, &ts, timeout_time);
      
      (gdb) print *ps
      $4 = {id = 0, internal_fd_limit = 7, no_of_user_fds = {counter = 2}, kp_fd = 4, oneshot = 1, poll_fds = 0x0, poll_fds_ix = 2, fds_status = 0x1067ce520, fds_status_len = 1024, update_requests = {next = 0x0, len = 0, fds = {12, 15,
            0 <repeats 126 times>}}, curr_upd_req_block = 0x1067ce2c8, have_update_requests = {counter = 0}, mtx = {mtx = {pt_mtx = {__pthread_mutex_flags = {__pthread_mutex_flag1 = 4, __pthread_mutex_flag2 = 0 '\000',
                __pthread_mutex_ceiling = 0 '\000', __pthread_mutex_type = 0, __pthread_mutex_magic = 19800}, __pthread_mutex_lock = {__pthread_mutex_lock64 = {__pthread_mutex_pad = "\000\000\000\000\000\000\000"}, __pthread_mutex_lock32 = {
                  __pthread_ownerpid = 0, __pthread_lockword = 0}, __pthread_mutex_owner64 = 0}, __pthread_mutex_data = 0}}}, timeout_time = 0, wakeup_state = {counter = 0}, wake_fds = {5, 6}}
      
      (gdb) print poll_res
      $5 = {dp_fds = 0x1067d1568, dp_nfds = 3, dp_timeout = -1}
      
      (gdb) print *(poll_res.dp_fds)
      $9 = {fd = 12, events = 1, revents = -62}
      
      (gdb) thread 274
      [Switching to thread 274 (Thread 2 (LWP 2))]
      #0  0x00007fffff0e1374 in __read () from /lib/64/libc.so.1
      
      (gdb) backtrace
      #0  0x00007fffff0e1374 in __read () from /lib/64/libc.so.1
      #1  0x00007fffff0cb47c in read () from /lib/64/libc.so.1
      #2  0x00000001002ac14c in signal_dispatcher_thread_func (unused=<optimized out>) at sys/unix/sys.c:993
      #3  0x000000010033c6f0 in thr_wrapper (vtwd=0xffffffff7fffd888) at pthread/ethread.c:118
      #4  0x00007fffff0dc678 in _lwp_start () from /lib/64/libc.so.1
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
      
      (gdb) frame 2
      #2  0x00000001002ac14c in signal_dispatcher_thread_func (unused=<optimized out>) at sys/unix/sys.c:993
      993                 res = read(sig_notify_fds[0], (void *) &sb.buf[i], sizeof(int) - i);
      
      (gdb) list
      988             Eterm signal;
      989             int res, i = 0;
      990             /* Block on read() waiting for a signal notification to arrive... */
      991
      992             do {
      993                 res = read(sig_notify_fds[0], (void *) &sb.buf[i], sizeof(int) - i);
      994                 i += res > 0 ? res : 0;
      995             } while ((i < sizeof(int) && res >= 0) || (res < 0 && errno == EINTR));
      996
      997             if (res < 0) {
      
      (gdb) print sig_notify_fds
      $11 = {7, 8}
      

      Please let me know if you need more information or debugging.

      Thank you for the help,
      Libor

        Attachments

          Activity

            People

            Assignee:
            lukas Lukas Larsson
            Reporter:
            lbukata Libor Bukata
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated: