crypto:hmac/3 in OTP-22, and crypto:mac(hmac, ...) in OTP-23, is several orders of magnitude slower than OTP-21 in situations with moderate concurrency.
The attached test case simulates performing SASL authentication while starting a number of workers to consume from Kafka, which boils down to performing a large number of crypto hmac operations. The test case fires off a group of concurrent workers and yields the longest time any one worker needed to perform the initial authentication. This is repeated with increasing group sizes: 10, 20, 40, 80, and 160.
The following measurements are from an AWS r5.16xlarge node with 64 vCPUs, running CentOS 7.
The curve is flat until we exceed number of vCPUs, at which point it's linear in concurrency divided by number of vCPUs. With 160 workers the worst case took 58 ms.
The baseline (10 workers) is an order of magnitude higher than with OTP-21, and the timings grow exponentially. With 160 workers the worst case is 28 seconds, which is 484 times higher than with OTP-21.
In our production code this causes the application to fail completely with OTP-22, as all connection attempts time out and restart without being able to make progress. (We may be able to work around the issue, but that is besides the point.)
The issue is entirely reproducible on nodes with 48 vCPUs (r5n.24xlarge, 96 vCPUs with HT disabled) and 64 vCPUs (r5.16xlarge, HT not disabled), all running CentOS 7. It is less noticeable on smaller systems with 2-16 vCPUs.
A git bisect identified:
I've tried profiling with perf and gprof. Both indicate that OTP-22 and OTP-23 spend inordinate amounts of time in rwmutex ops and ethr_event_swait, and in yield and futex system calls. gprof also points to erts_atom_put_index. Neither reports data from crypto, presumably because it's a NIF .so.