Optimize LatencyUtils: concurrent, autoboxing#2385
Conversation
|
@Rothes Ideally this should only be running on the netty thread. There are only a few cases where it isn’t, such as piston events. If we handled those cases properly and ran them on the netty thread, we wouldn’t need any concurrent lists or synchronized blocks. |
So I guess it's better to seperate another concurrent queue(still need it on folia, or checkTickThread with bukkit api) on async method, and do both queue on handleNettySyncTransaction() ? |
ConcurrentLinkedQueue is weakly consistent. It is theoretically possible (though incredibly rare in practice) for this to cause ordering issues. We ideally want a strongly consistent solution that fully respects hard execution order. This is something I want to avoid similar to our usage of a Long2ObjectHashSet we use for storing a player's loaded chunks, which we access on the main thread for piston events for reads which can cause issues like this very rarely and also be quietly incorrect. The autoboxing changes are a fine microoptization however and I'll merge them. If you want me to merge a revamp of LatencyUtils I would be open to though, if you can write a packet based Piston implementation (so it no longer needs to be thread safe) or with a faster strongly consistent solution. |
|
Fastutil does not support concurrent, we sill need to use synchronized for it. But it turns out that it's slow than just iterator all over the ConcurrentLinkedQueue. |
|
Based on my testing in the production environment, this new structure no longer produces additional CPU usage. It may be further optimized by packet based piston movements—but it's good enough already. |
|
The current state of this PR requires significant work to become mergeable.
I only briefly looked at the changes, and I suspect there are more issues in addition. If you're certain on continuing and finishing this I welcome you to keep trying but would recommend you take on a more low hanging fruit optimization issue first such as continuing my work in and finishing #2312 Again, I'll also merge the autoboxing changes if you want to make those a seperate PR. |
Removing any element in ArrayDeque moves all left elements to right, unless it's inserted in order. It can not be slower any more. I don't think it's necessary to keep the order of async and netty transactions, if so, would blocking netty thread for a while and server tick thread still running, breaks the plugin? The order of this two is never certain, and I'm really not a fan of making everything "right"; I want everything to be "fit" and efficient. I don't like to add any uncessary concurrent cost on netty thread. This patch works perfectly fine in my production environment. If you insist on getting the most perfect implementation directly, I would close this PR. |
|
Did you benchmark the custom queue to make sure it was really faster than ArrayDeque? In most cases we are removing from the head, not the middle. Since ArrayDeque is implemented as a ring buffer, removal from the head is O(1). When you factor in the additional GC pressure from the Node allocations in a Linked List, I'd be very surprised if it was actually faster. I did some experimenting a long time ago when I was optimizing block update runnables with changing the LatencyUtil impl as well. You can find my old experiments and JMH code here https://github.com/Axionize/LightningGrim/tree/optimize/block-updates-and-transaction-runnables if you would like to test.
It sounds ridiculous, right? I was just as astonished when I discovered this when I started working on Grim. But yes, in the current implementation, if you block the netty or main thread, updates will be applied out of order and can cause falses. Since we only use the async tasks for handling pistons its fine, this is fairly rare. For best chance of reproduction you gotta be colliding with pistons (especially with slime or honey blocks) and other blocks. While I've wanted to replace it for a long time, its one of those things that's really difficult to do and the current solution is "good enough" (people don't get stuck on slimeblock launchers, and it's not like PVP maps are made of moving slime block platforms... although now that I think about it, that sounds fun). The issue is that while the current code is technically incorrect (race condition), your proposed changes are "more incorrect" because they ensure updates will always be applied out of order in those scenarios.
I'm sorry if I've come across as combative. I don't need the "most perfect" implementation, but I do need to ensure there are no correctness regressions. I have plenty of times pushed out an update I've tested that I thought was falseless, only to have issues start trickling in as soon as it hit the tens of thousands of servers that run Grim. May I ask what kind of server you've run this on in prod? If you want, I can test out some variations of this in LightningGrim, similar to my block update runnable optimizations. If they prove stable, I can merge them upstream. I usually route all our experimental changes that way, as it has a fairly large user base for reporting issues who understand that it's experimental. I understand this can feel frustrating for what seems like such a simple PR, but when you touch something as core as LatencyUtils and change execution order we need to ensure its correct and that takes a lot of data from tons of servers in production and time. I started out in a very similar position with my first PR #1679 but with @ManInMyVan 's help who gave me what felt like combative but ultimately beneficial review help I spun it into smaller key PRs and improved it greatly. |
|
All my changes are based on performance analysis tools in my production environment. This is a Folia server with over 50 players and many farms/mob farms. I started the change when I occasionally noticed the synchronized block took ~4% of my CPU on netty thread. This race isn't even a persistence issue – I frequently do performance analysis to uncover potential problems on the server. I have reason to believe this is caused by a subset of players (most likely using pistons). Only a small number of players can cause so much overhead. What if more players were in this kind of extreme environment? Like my commit time, I used ArrayDeque and runs it on my production environment. This yielded a stable 3% occupancy, it was moving elements. The problem is that some transaction is added in player.lastTransactionSent, some are lastTransactionSent + 1. It's not all at head in order. So we have the cost. World changes on the tick thread are not updated just in time. They are deferred until a certain time before being sent to the player from the queue. This sync system is made in magic and lacks reliability already. Finally, this PR became what it is now. I've observed it for a long time and no longer see the queue processing causing a significant overhead. It's just 0.0%. I understand the reliability concerns, but I also have performance concerns. Grim is such a low-level plugin that core performance is extremely important. Perhaps in the first place, asynchronous transactions should not appear, but rather packets should have been used instead. |
|
I need to clarify – because world changes on the tick thread are not in time, in the current live build, piston changes are always either ahead of time or sync between all transactions. However, the modified queue impl may be ahead, or delayed, or sync. Regardless of the modification, this prediction is inaccurate. This PR is unlikely to change or amplify the problem, because inconsistency already exists... Unless we move to packet based pistion movements, but I have no interest in doing it. |
ConcurrentLinkedQueue is particularly suitable for this scenario. Changes to the queue are atomic, so it's thread-safe.
This change has not actually been tested, but should be much faster than the synchronized blocks, especially when add the tasks.
Additionally, Changed Pair to record class. Using Pair will autoboxing the ints, it's an unnecessary cost.