Skip to content

Optimize LatencyUtils: concurrent, autoboxing#2385

Open
Rothes wants to merge 6 commits into
GrimAnticheat:2.0from
Rothes:concurrent-opt
Open

Optimize LatencyUtils: concurrent, autoboxing#2385
Rothes wants to merge 6 commits into
GrimAnticheat:2.0from
Rothes:concurrent-opt

Conversation

@Rothes
Copy link
Copy Markdown
Contributor

@Rothes Rothes commented Dec 8, 2025

ConcurrentLinkedQueue is particularly suitable for this scenario. Changes to the queue are atomic, so it's thread-safe.

This change has not actually been tested, but should be much faster than the synchronized blocks, especially when add the tasks.

Additionally, Changed Pair to record class. Using Pair will autoboxing the ints, it's an unnecessary cost.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 8, 2025

CLA assistant check
All committers have signed the CLA.

@AoElite
Copy link
Copy Markdown
Contributor

AoElite commented Dec 8, 2025

@Rothes Ideally this should only be running on the netty thread. There are only a few cases where it isn’t, such as piston events. If we handled those cases properly and ran them on the netty thread, we wouldn’t need any concurrent lists or synchronized blocks.

@Rothes
Copy link
Copy Markdown
Contributor Author

Rothes commented Dec 8, 2025

@Rothes Ideally this should only be running on the netty thread. There are only a few cases where it isn’t, such as piston events. If we handled those cases properly and ran them on the netty thread, we wouldn’t need any concurrent lists or synchronized blocks.

So I guess it's better to seperate another concurrent queue(still need it on folia, or checkTickThread with bukkit api) on async method, and do both queue on handleNettySyncTransaction() ?

@Axionize
Copy link
Copy Markdown
Contributor

Axionize commented Dec 9, 2025

ConcurrentLinkedQueue is particularly suitable for this scenario. Changes to the queue are atomic, so it's thread-safe.

This change has not actually been tested, but should be much faster than the synchronized blocks, especially when add the tasks.

Additionally, Changed Pair to record class. Using Pair will autoboxing the ints, it's an unnecessary cost.

ConcurrentLinkedQueue is weakly consistent. It is theoretically possible (though incredibly rare in practice) for this to cause ordering issues. We ideally want a strongly consistent solution that fully respects hard execution order. This is something I want to avoid similar to our usage of a Long2ObjectHashSet we use for storing a player's loaded chunks, which we access on the main thread for piston events for reads which can cause issues like this very rarely and also be quietly incorrect.

#2031
#2337
#2214

The autoboxing changes are a fine microoptization however and I'll merge them.

If you want me to merge a revamp of LatencyUtils I would be open to though, if you can write a packet based Piston implementation (so it no longer needs to be thread safe) or with a faster strongly consistent solution.

@Rothes
Copy link
Copy Markdown
Contributor Author

Rothes commented Dec 9, 2025

Fastutil does not support concurrent, we sill need to use synchronized for it.
I do tried to use a ConcurrentHashMap to avoid iterator over all the queue,

private final ConcurrentHashMap<Integer, ConcurrentLinkedQueue<QueuedTask>> transactionMap2 = new ConcurrentHashMap<>();

            var queue = transactionMap2.computeIfAbsent(transaction, __ -> new ConcurrentLinkedQueue<>());
            queue.add(new QueuedTask(transaction, runnable));

But it turns out that it's slow than just iterator all over the ConcurrentLinkedQueue.

@Rothes
Copy link
Copy Markdown
Contributor Author

Rothes commented Dec 14, 2025

Based on my testing in the production environment, this new structure no longer produces additional CPU usage. It may be further optimized by packet based piston movements—but it's good enough already.

@Axionize
Copy link
Copy Markdown
Contributor

Axionize commented Dec 23, 2025

The current state of this PR requires significant work to become mergeable.

  • Your queue implementation is slower than ArrayDeque and has bad cache locality. Its also incorrect.
    • The ideal data structure for this issue is a MPSC queue with a strongly consistent iterator.
  • Your are executing transactions and transactionsAsync one after another, meaning interleaving is impossible which breaks execution ordering of listeners for transactions.

I only briefly looked at the changes, and I suspect there are more issues in addition. If you're certain on continuing and finishing this I welcome you to keep trying but would recommend you take on a more low hanging fruit optimization issue first such as continuing my work in and finishing #2312

Again, I'll also merge the autoboxing changes if you want to make those a seperate PR.

@Rothes
Copy link
Copy Markdown
Contributor Author

Rothes commented Dec 23, 2025

The current state of this PR requires significant work to become mergeable.

  • Your queue implementation is slower than ArrayDeque and has bad cache locality. Its also incorrect.

    • The ideal data structure for this issue is a MPSC queue with a strongly consistent iterator.
  • Your are executing transactions and transactionsAsync one after another, meaning interleaving is impossible which breaks execution ordering of listeners for transactions.

I only briefly looked at the changes, and I suspect there are more issues in addition. If you're certain on continuing and finishing this I welcome you to keep trying but would recommend you take on a more low hanging fruit optimization issue first such as #2312

Again, I'll also merge the autoboxing changes if you want to make those a seperate PR.

Removing any element in ArrayDeque moves all left elements to right, unless it's inserted in order. It can not be slower any more.
This list is basically a normal java.util.LinkedList, same impl, just no Index checks in iterator, and limited loop size.

I don't think it's necessary to keep the order of async and netty transactions, if so, would blocking netty thread for a while and server tick thread still running, breaks the plugin? The order of this two is never certain, and I'm really not a fan of making everything "right"; I want everything to be "fit" and efficient. I don't like to add any uncessary concurrent cost on netty thread.

This patch works perfectly fine in my production environment. If you insist on getting the most perfect implementation directly, I would close this PR.

@Axionize
Copy link
Copy Markdown
Contributor

Did you benchmark the custom queue to make sure it was really faster than ArrayDeque? In most cases we are removing from the head, not the middle. Since ArrayDeque is implemented as a ring buffer, removal from the head is O(1). When you factor in the additional GC pressure from the Node allocations in a Linked List, I'd be very surprised if it was actually faster.

I did some experimenting a long time ago when I was optimizing block update runnables with changing the LatencyUtil impl as well. You can find my old experiments and JMH code here https://github.com/Axionize/LightningGrim/tree/optimize/block-updates-and-transaction-runnables if you would like to test.

I don't think it's necessary to keep the order of async and netty transactions, if so, would blocking netty thread for a while and server tick thread still running, breaks the plugin? The order of this two is never certain, and I'm really not a fan of making everything "right"; I want everything to be "fit" and efficient. I don't like to add any uncessary concurrent cost on netty thread.

It sounds ridiculous, right? I was just as astonished when I discovered this when I started working on Grim. But yes, in the current implementation, if you block the netty or main thread, updates will be applied out of order and can cause falses.

Since we only use the async tasks for handling pistons its fine, this is fairly rare. For best chance of reproduction you gotta be colliding with pistons (especially with slime or honey blocks) and other blocks. While I've wanted to replace it for a long time, its one of those things that's really difficult to do and the current solution is "good enough" (people don't get stuck on slimeblock launchers, and it's not like PVP maps are made of moving slime block platforms... although now that I think about it, that sounds fun).

The issue is that while the current code is technically incorrect (race condition), your proposed changes are "more incorrect" because they ensure updates will always be applied out of order in those scenarios.

This patch works perfectly fine in my production environment. If you insist on getting the most perfect implementation directly, I would close this PR.

I'm sorry if I've come across as combative. I don't need the "most perfect" implementation, but I do need to ensure there are no correctness regressions.

I have plenty of times pushed out an update I've tested that I thought was falseless, only to have issues start trickling in as soon as it hit the tens of thousands of servers that run Grim. May I ask what kind of server you've run this on in prod?

If you want, I can test out some variations of this in LightningGrim, similar to my block update runnable optimizations. If they prove stable, I can merge them upstream. I usually route all our experimental changes that way, as it has a fairly large user base for reporting issues who understand that it's experimental.

I understand this can feel frustrating for what seems like such a simple PR, but when you touch something as core as LatencyUtils and change execution order we need to ensure its correct and that takes a lot of data from tons of servers in production and time.

I started out in a very similar position with my first PR #1679 but with @ManInMyVan 's help who gave me what felt like combative but ultimately beneficial review help I spun it into smaller key PRs and improved it greatly.

#1725
#1957
#1732

@Rothes
Copy link
Copy Markdown
Contributor Author

Rothes commented Dec 24, 2025

All my changes are based on performance analysis tools in my production environment. This is a Folia server with over 50 players and many farms/mob farms.

I started the change when I occasionally noticed the synchronized block took ~4% of my CPU on netty thread. This race isn't even a persistence issue – I frequently do performance analysis to uncover potential problems on the server. I have reason to believe this is caused by a subset of players (most likely using pistons). Only a small number of players can cause so much overhead. What if more players were in this kind of extreme environment?

Like my commit time, I used ArrayDeque and runs it on my production environment. This yielded a stable 3% occupancy, it was moving elements. The problem is that some transaction is added in player.lastTransactionSent, some are lastTransactionSent + 1. It's not all at head in order. So we have the cost.

World changes on the tick thread are not updated just in time. They are deferred until a certain time before being sent to the player from the queue. This sync system is made in magic and lacks reliability already.

Finally, this PR became what it is now. I've observed it for a long time and no longer see the queue processing causing a significant overhead. It's just 0.0%.

I understand the reliability concerns, but I also have performance concerns. Grim is such a low-level plugin that core performance is extremely important. Perhaps in the first place, asynchronous transactions should not appear, but rather packets should have been used instead.

@Rothes
Copy link
Copy Markdown
Contributor Author

Rothes commented Dec 24, 2025

I need to clarify – because world changes on the tick thread are not in time, in the current live build, piston changes are always either ahead of time or sync between all transactions.

However, the modified queue impl may be ahead, or delayed, or sync.

Regardless of the modification, this prediction is inaccurate. This PR is unlikely to change or amplify the problem, because inconsistency already exists... Unless we move to packet based pistion movements, but I have no interest in doing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants