Optimize LatencyUtils: concurrent, autoboxing by Rothes · Pull Request #2385 · GrimAnticheat/Grim

Rothes · 2025-12-08T09:38:43Z

ConcurrentLinkedQueue is particularly suitable for this scenario. Changes to the queue are atomic, so it's thread-safe.

This change has not actually been tested, but should be much faster than the synchronized blocks, especially when add the tasks.

Additionally, Changed Pair to record class. Using Pair will autoboxing the ints, it's an unnecessary cost.

CLAassistant · 2025-12-08T09:38:52Z

All committers have signed the CLA.

AoElite · 2025-12-08T12:12:36Z

@Rothes Ideally this should only be running on the netty thread. There are only a few cases where it isn’t, such as piston events. If we handled those cases properly and ran them on the netty thread, we wouldn’t need any concurrent lists or synchronized blocks.

Rothes · 2025-12-08T12:20:48Z

@Rothes Ideally this should only be running on the netty thread. There are only a few cases where it isn’t, such as piston events. If we handled those cases properly and ran them on the netty thread, we wouldn’t need any concurrent lists or synchronized blocks.

So I guess it's better to seperate another concurrent queue(still need it on folia, or checkTickThread with bukkit api) on async method, and do both queue on handleNettySyncTransaction() ?

Axionize · 2025-12-09T02:32:12Z

ConcurrentLinkedQueue is particularly suitable for this scenario. Changes to the queue are atomic, so it's thread-safe.

This change has not actually been tested, but should be much faster than the synchronized blocks, especially when add the tasks.

Additionally, Changed Pair to record class. Using Pair will autoboxing the ints, it's an unnecessary cost.

ConcurrentLinkedQueue is weakly consistent. It is theoretically possible (though incredibly rare in practice) for this to cause ordering issues. We ideally want a strongly consistent solution that fully respects hard execution order. This is something I want to avoid similar to our usage of a Long2ObjectHashSet we use for storing a player's loaded chunks, which we access on the main thread for piston events for reads which can cause issues like this very rarely and also be quietly incorrect.

#2031
#2337
#2214

The autoboxing changes are a fine microoptization however and I'll merge them.

If you want me to merge a revamp of LatencyUtils I would be open to though, if you can write a packet based Piston implementation (so it no longer needs to be thread safe) or with a faster strongly consistent solution.

Rothes · 2025-12-09T02:53:44Z

Fastutil does not support concurrent, we sill need to use synchronized for it.
I do tried to use a ConcurrentHashMap to avoid iterator over all the queue,

private final ConcurrentHashMap<Integer, ConcurrentLinkedQueue<QueuedTask>> transactionMap2 = new ConcurrentHashMap<>();

            var queue = transactionMap2.computeIfAbsent(transaction, __ -> new ConcurrentLinkedQueue<>());
            queue.add(new QueuedTask(transaction, runnable));

But it turns out that it's slow than just iterator all over the ConcurrentLinkedQueue.

Rothes · 2025-12-14T11:37:54Z

Based on my testing in the production environment, this new structure no longer produces additional CPU usage. It may be further optimized by packet based piston movements—but it's good enough already.

Axionize · 2025-12-23T14:21:28Z

The current state of this PR requires significant work to become mergeable.

Your queue implementation is slower than ArrayDeque and has bad cache locality. Its also incorrect.
- The ideal data structure for this issue is a MPSC queue with a strongly consistent iterator.
Your are executing transactions and transactionsAsync one after another, meaning interleaving is impossible which breaks execution ordering of listeners for transactions.

I only briefly looked at the changes, and I suspect there are more issues in addition. If you're certain on continuing and finishing this I welcome you to keep trying but would recommend you take on a more low hanging fruit optimization issue first such as continuing my work in and finishing #2312

Again, I'll also merge the autoboxing changes if you want to make those a seperate PR.

Rothes · 2025-12-23T14:34:10Z

The current state of this PR requires significant work to become mergeable.

Your queue implementation is slower than ArrayDeque and has bad cache locality. Its also incorrect.

The ideal data structure for this issue is a MPSC queue with a strongly consistent iterator.

Your are executing transactions and transactionsAsync one after another, meaning interleaving is impossible which breaks execution ordering of listeners for transactions.

I only briefly looked at the changes, and I suspect there are more issues in addition. If you're certain on continuing and finishing this I welcome you to keep trying but would recommend you take on a more low hanging fruit optimization issue first such as #2312

Again, I'll also merge the autoboxing changes if you want to make those a seperate PR.

Removing any element in ArrayDeque moves all left elements to right, unless it's inserted in order. It can not be slower any more.
This list is basically a normal java.util.LinkedList, same impl, just no Index checks in iterator, and limited loop size.

I don't think it's necessary to keep the order of async and netty transactions, if so, would blocking netty thread for a while and server tick thread still running, breaks the plugin? The order of this two is never certain, and I'm really not a fan of making everything "right"; I want everything to be "fit" and efficient. I don't like to add any uncessary concurrent cost on netty thread.

This patch works perfectly fine in my production environment. If you insist on getting the most perfect implementation directly, I would close this PR.

Axionize · 2025-12-24T06:49:32Z

Did you benchmark the custom queue to make sure it was really faster than ArrayDeque? In most cases we are removing from the head, not the middle. Since ArrayDeque is implemented as a ring buffer, removal from the head is O(1). When you factor in the additional GC pressure from the Node allocations in a Linked List, I'd be very surprised if it was actually faster.

I did some experimenting a long time ago when I was optimizing block update runnables with changing the LatencyUtil impl as well. You can find my old experiments and JMH code here https://github.com/Axionize/LightningGrim/tree/optimize/block-updates-and-transaction-runnables if you would like to test.

I don't think it's necessary to keep the order of async and netty transactions, if so, would blocking netty thread for a while and server tick thread still running, breaks the plugin? The order of this two is never certain, and I'm really not a fan of making everything "right"; I want everything to be "fit" and efficient. I don't like to add any uncessary concurrent cost on netty thread.

It sounds ridiculous, right? I was just as astonished when I discovered this when I started working on Grim. But yes, in the current implementation, if you block the netty or main thread, updates will be applied out of order and can cause falses.

Since we only use the async tasks for handling pistons its fine, this is fairly rare. For best chance of reproduction you gotta be colliding with pistons (especially with slime or honey blocks) and other blocks. While I've wanted to replace it for a long time, its one of those things that's really difficult to do and the current solution is "good enough" (people don't get stuck on slimeblock launchers, and it's not like PVP maps are made of moving slime block platforms... although now that I think about it, that sounds fun).

The issue is that while the current code is technically incorrect (race condition), your proposed changes are "more incorrect" because they ensure updates will always be applied out of order in those scenarios.

This patch works perfectly fine in my production environment. If you insist on getting the most perfect implementation directly, I would close this PR.

I'm sorry if I've come across as combative. I don't need the "most perfect" implementation, but I do need to ensure there are no correctness regressions.

I have plenty of times pushed out an update I've tested that I thought was falseless, only to have issues start trickling in as soon as it hit the tens of thousands of servers that run Grim. May I ask what kind of server you've run this on in prod?

If you want, I can test out some variations of this in LightningGrim, similar to my block update runnable optimizations. If they prove stable, I can merge them upstream. I usually route all our experimental changes that way, as it has a fairly large user base for reporting issues who understand that it's experimental.

I understand this can feel frustrating for what seems like such a simple PR, but when you touch something as core as LatencyUtils and change execution order we need to ensure its correct and that takes a lot of data from tons of servers in production and time.

I started out in a very similar position with my first PR #1679 but with @ManInMyVan 's help who gave me what felt like combative but ultimately beneficial review help I spun it into smaller key PRs and improved it greatly.

#1725
#1957
#1732

Rothes · 2025-12-24T07:43:01Z

All my changes are based on performance analysis tools in my production environment. This is a Folia server with over 50 players and many farms/mob farms.

I started the change when I occasionally noticed the synchronized block took ~4% of my CPU on netty thread. This race isn't even a persistence issue – I frequently do performance analysis to uncover potential problems on the server. I have reason to believe this is caused by a subset of players (most likely using pistons). Only a small number of players can cause so much overhead. What if more players were in this kind of extreme environment?

Like my commit time, I used ArrayDeque and runs it on my production environment. This yielded a stable 3% occupancy, it was moving elements. The problem is that some transaction is added in player.lastTransactionSent, some are lastTransactionSent + 1. It's not all at head in order. So we have the cost.

World changes on the tick thread are not updated just in time. They are deferred until a certain time before being sent to the player from the queue. This sync system is made in magic and lacks reliability already.

Finally, this PR became what it is now. I've observed it for a long time and no longer see the queue processing causing a significant overhead. It's just 0.0%.

I understand the reliability concerns, but I also have performance concerns. Grim is such a low-level plugin that core performance is extremely important. Perhaps in the first place, asynchronous transactions should not appear, but rather packets should have been used instead.

Rothes · 2025-12-24T09:59:09Z

I need to clarify – because world changes on the tick thread are not in time, in the current live build, piston changes are always either ahead of time or sync between all transactions.

However, the modified queue impl may be ahead, or delayed, or sync.

Regardless of the modification, this prediction is inaccurate. This PR is unlikely to change or amplify the problem, because inconsistency already exists... Unless we move to packet based pistion movements, but I have no interest in doing it.

Rothes added 2 commits December 8, 2025 17:25

Use ConcurrentLinkedQueue for concurrent

62cc39d

Avoid int boxing-unboxing

df498be

Rothes added 2 commits December 8, 2025 17:46

Remove unused comment

99852f0

Fix passing the task

f27f73b

Separate sync and async tasks queue

e3ab49c

Replace ArrayDeque with own queue impl

5a4c9fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize LatencyUtils: concurrent, autoboxing#2385

Optimize LatencyUtils: concurrent, autoboxing#2385
Rothes wants to merge 6 commits into
GrimAnticheat:2.0from
Rothes:concurrent-opt

Rothes commented Dec 8, 2025

Uh oh!

CLAassistant commented Dec 8, 2025 •

edited

Loading

Uh oh!

AoElite commented Dec 8, 2025

Uh oh!

Rothes commented Dec 8, 2025

Uh oh!

Axionize commented Dec 9, 2025

Uh oh!

Rothes commented Dec 9, 2025

Uh oh!

Rothes commented Dec 14, 2025

Uh oh!

Axionize commented Dec 23, 2025 •

edited

Loading

Uh oh!

Rothes commented Dec 23, 2025 •

edited

Loading

Uh oh!

Axionize commented Dec 24, 2025

Uh oh!

Rothes commented Dec 24, 2025

Uh oh!

Rothes commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Rothes commented Dec 8, 2025

Uh oh!

CLAassistant commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AoElite commented Dec 8, 2025

Uh oh!

Rothes commented Dec 8, 2025

Uh oh!

Axionize commented Dec 9, 2025

Uh oh!

Rothes commented Dec 9, 2025

Uh oh!

Rothes commented Dec 14, 2025

Uh oh!

Axionize commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rothes commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Axionize commented Dec 24, 2025

Uh oh!

Rothes commented Dec 24, 2025

Uh oh!

Rothes commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Dec 8, 2025 •

edited

Loading

Axionize commented Dec 23, 2025 •

edited

Loading

Rothes commented Dec 23, 2025 •

edited

Loading