So, in my previous post I said how wonderful FP37434 is (the replication stabilisation FP). Unfortunately, it did not solve our problem and we now have a large volume of content to reverse replicate (~50k nodes in /var/replication/outbox across all our publish servers).
We are currently facing 2 problems. When the RR agent polls, the publish server with FP37434 exhibits a huge native memory leak (approx 8GB of native memory is being claimed) causing a great deal of paging on the system.
When we batch this down to only 10 items in the outbox, we noticed that the author takes 30 minutes to process 10 nodes.
Adding extra logging (com.day.cq.replication.content.durbo) at DEBUG level shows that the Author is doing valid work for 30 minutes processing just 10 nodes from the outbox.
It turns out that when a node is added to /content/usergenerated/path/to/something then CQ appears to be adding all of the pre-existing sibling nodes in the newly created node under /var/replication/outbox. You can see this by analysing nodes inside the outbox. This is why 10 nodes takes 30 minutes for the author to process - because it's actually unpacking 10000 nodes.
This probably also explains why our CQ author is performing slowly.
Hopefully, I will remember to post the solution here when we get to it ... :-)