Uploaded image for project: 'Erlang/OTP'
  1. Erlang/OTP
  2. ERL-1112

DTLS socket unable to receive on Kubernetes node scale-up

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not a Bug
    • Affects Version/s: 22.1
    • Fix Version/s: None
    • Component/s: ssl
    • Labels:
      None

      Description

      This issue has been observed on OTP 22 (erts-10.5.6). The underlying platform is Azure Kubernetes Service (AKS) version 1.13.11 with VM nodes running Ubuntu 16.04.6 LTS.

      A Docker container is started running image elixir:1.9.4-alpine and a simple Erlang DTLS server:

      :ssl.start()
      opts = [
        protocol: :dtls, active: true, mode: :binary,
        versions: [:'dtlsv1.2'], verify: :verify_none, fail_if_no_peer_cert: false,
        cacertfile: "/cert/ssl.crt", keyfile: "/cert/ssl.key", certfile: "/cert/ssl.crt"
      ]
      {:ok, listen_socket} = :ssl.listen(49002, opts)
      {:ok, hsocket} = :ssl.transport_accept(listen_socket, 10_000)
      {:ok, socket} = :ssl.handshake(hsocket, 10_000)
      

      The server is deployed to Kubernetes and exposed with a Kubernetes service as follows:

      apiVersion: extensions/v1beta1
      kind: Deployment
      metadata:
        labels:
          app: dtls-server
        name: dtls-server
      spec:
        selector:
          matchLabels:
            app: dtls-server
        template:
          metadata:
            labels:
              app: dtls-server
          spec:
            containers:
            - image: elixir:1.9.4-alpine
              name: dtls-server
              command: ["top"]
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: dtls-server
      spec:
        type: LoadBalancer
        ports:
        - name: s-server-dtls
          port: 49001
          protocol: UDP
        - name: erlang-dtls
          port: 49002
          protocol: UDP
        selector:
          app: dtls-server
      

      From my local machine (running Arch Linux with OpenSSL version 1.1.1d 10 Sep 2019), I can connect to the server using openssl s_client:

      openssl s_client -dtls1_2 ip:49002
      

      Using s_client, I am able to send and receive data.

      The AKS cluster is then scaled up, that is, an extra node is added to the cluster:

      az aks nodepool scale -g rg --cluster-name cluster --name agentpool --node-count 2
      

      After some time, kubectl get events shows the node being added:

      15m         Normal    Starting                  Node         Starting kubelet.
      15m         Normal    NodeHasSufficientPID      Node         Node aks-agentpool-90776725-vmss000001 status is now: NodeHasSufficientPID
      15m         Normal    NodeAllocatableEnforced   Node         Updated Node Allocatable limit across pods
      15m         Normal    NodeHasNoDiskPressure     Node         Node aks-agentpool-90776725-vmss000001 status is now: NodeHasNoDiskPressure
      15m         Normal    NodeHasSufficientMemory   Node         Node aks-agentpool-90776725-vmss000001 status is now: NodeHasSufficientMemory
      15m         Normal    NodeReady                 Node         Node aks-agentpool-90776725-vmss000001 status is now: NodeReady
      15m         Normal    RegisteredNode            Node         Node aks-agentpool-90776725-vmss000001 event: Registered Node aks-agentpool-90776725-vmss000001 in Controller
      15m         Normal    Starting                  Node         Starting kube-proxy.
      14m         Normal    UpdatedLoadBalancer       Service      Updated load balancer with new hosts
      

      At some point during this scale operation, the Erlang DTLS server can no longer receive any packets sent from the client.

      iex(7)> {:ok, socket} = :ssl.handshake(hsocket, 20_000)
      {:ok,
       {:sslsocket,
        {:gen_udp, {#PID<0.138.0>, {{{10, 240, 0, 5}, 41519}, #Port<0.6>}},
         :dtls_connection}, [#PID<0.140.0>]}}
      iex(8)> flush
      {:ssl,
       {:sslsocket,
        {:gen_udp, {#PID<0.138.0>, {{{10, 240, 0, 5}, 41519}, #Port<0.6>}},
         :dtls_connection}, [#PID<0.140.0>]}, "abc\n"}
      :ok
      iex(9)> flush
      :ok
      

      However, the server is still able to send messages to the client.

      At the same time, an openssl s_sever was run in the same container and it maintained the ability to send and receive throughout the scaling.

        Attachments

          Activity

            People

            Assignee:
            otp_team_ps Team PS
            Reporter:
            ahovgaard Anders Kiel Hovgaard
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: