I have been considering using distributed application mechanism for a project, however it seems very unlikely we use this feature. I bet we are not alone, since it's got a major flaw preventing the adoption on cloud-based set ups: it does not recover at all from network splits.
This is, in fact, well known behavior. See:
"If you deem netsplits more likely than hardware failures, then you have to be aware of the possibility that the application is running both as a backup and main one, and that funny things could happen when the network issue is resolved. Maybe distributed OTP applications aren't the right mechanism for you in these cases."
I have ran some tests, and it is precisely what happens indeed. The takeover does not take effect - contrary to situation when one node goes down and then is started again.
While netsplit may be rare when you run cluster on co-located servers, sitting in the same data center, it is more likely when you have a cloud deployment. Considering how popular cloud deployments are these days, this renders Erlang's built in mechanism for distributed applications unsuitable for these cloud based scenarios.
I propose there's either an update to existing dist_ac module needed, or a second version of such controller is required to allow people setting up fairly simple clusters of Erlang nodes, with failover and takeover, which recovers from instances of network disconnections.