RFC 3986 contains an unfortunate oversight of not including percent-sign into the set of reserved characters. However, it does state that (2.4)
Because the percent ("%") character serves as the indicator for
percent-encoded octets, it must be percent-encoded as "%25" for that
octet to be used as data within a URI.
Unfortunately, current implementation of uri_string:normalize/1,2 does not take this into account when decoding percent-encoded characters for normalization. This leaves one with a mess of mixcoded characters, but what's worse, it is ambiguous:
In linked issue, it is discussed that current behaviour may be intentional - you must be vigilant and percent-encode your percent signs. However, current implementation in uri_string is wholly ambiguous as to when data is encoded or decoded - indeed, as we see, normalization proceeds to decode it, but never encode. To quote the RFC again,
Under normal circumstances, the only time when octets within a URI
are percent-encoded is during the process of producing the URI from
its component parts. This is when an implementation determines which
of the reserved characters are to be used as subcomponent delimiters
and which can be safely used as data. Once produced, a URI is always
in its percent-encoded form.
When a URI is dereferenced, the components and subcomponents
significant to the scheme-specific dereferencing process (if any)
must be parsed and separated before the percent-encoded octets within
those components can be safely decoded, as otherwise the data may be
mistaken for component delimiters.
I think an argument could be made that reconstituting an URL from parts implies an encoding, and thus taking it apart must imply a decoding, but this is speculation. However, what I think is extremely important is that parse -> recompose cycle is at least idempotent, and so is normalize/1,2. Currently, given a string "%25252525...", each normalize/1 will eat away at one twenty-five each time it is called.