Uploaded image for project: 'Erlang/OTP'
  1. Erlang/OTP
  2. ERL-1444

uri_string:normalize incorrectly handles percent-encoding



    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 23.2
    • Component/s: stdlib
    • Labels:


      RFC 3986 contains an unfortunate oversight of not including percent-sign into the set of reserved characters. However, it does state that (2.4)

      Because the percent ("%") character serves as the indicator for
      percent-encoded octets, it must be percent-encoded as "%25" for that
      octet to be used as data within a URI.

      Unfortunately, current implementation of uri_string:normalize/1,2 does not take this into account when decoding percent-encoded characters for normalization. This leaves one with a mess of mixcoded characters, but what's worse, it is ambiguous:

      3> uri_string:normalize("/oh%252Fmy").
      4> uri_string:normalize("/oh%2Fmy").

      In linked issue, it is discussed that current behaviour may be intentional - you must be vigilant and percent-encode your percent signs. However, current implementation in uri_string is wholly ambiguous as to when data is encoded or decoded - indeed, as we see, normalization proceeds to decode it, but never encode. To quote the RFC again,

      Under normal circumstances, the only time when octets within a URI
      are percent-encoded is during the process of producing the URI from
      its component parts. This is when an implementation determines which
      of the reserved characters are to be used as subcomponent delimiters
      and which can be safely used as data. Once produced, a URI is always
      in its percent-encoded form.


      When a URI is dereferenced, the components and subcomponents
      significant to the scheme-specific dereferencing process (if any)
      must be parsed and separated before the percent-encoded octets within
      those components can be safely decoded, as otherwise the data may be
      mistaken for component delimiters.

      (emphasis mine)

      I think an argument could be made that reconstituting an URL from parts implies an encoding, and thus taking it apart must imply a decoding, but this is speculation. However, what I think is extremely important is that parse -> recompose cycle is at least idempotent, and so is normalize/1,2. Currently, given a string "%25252525...", each normalize/1 will eat away at one twenty-five each time it is called.


          Issue Links



              peterdmv P├ęter Dimitrov
              alex0player Alex S
              0 Vote for this issue
              4 Start watching this issue