Skip to content

Sketch: post-persistent MonitorEvents codebase#4640

Draft
valentinewallace wants to merge 47 commits into
lightningdevkit:mainfrom
valentinewallace:2026-05-persistent-mon-evs-cleanup
Draft

Sketch: post-persistent MonitorEvents codebase#4640
valentinewallace wants to merge 47 commits into
lightningdevkit:mainfrom
valentinewallace:2026-05-persistent-mon-evs-cleanup

Conversation

@valentinewallace
Copy link
Copy Markdown
Contributor

@valentinewallace valentinewallace commented May 27, 2026

In the last commit, which is based on #4491, #4545, #4524, #4604, and #4638, I sketched out deleting all the structs/codepaths we wouldn't need anymore if we eventually switched to the persistent monitor events codepaths that are added in the aforementioned PRs. It doesn't account for backwards compat or anything, but I thought it could be useful when evaluating the overall persistent monitor events project. It should compile/pass tests.

Helps when debugging to know which variants failed.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them.  This
simplify things - on restart instead of examining the set of HTLCs in monitors
we can simply replay all the pending MonitorEvents.

As a first step towards this, here we persist a flag in the ChannelManager and
ChannelMonitors indicating whether this new feature is enabled. It will be used
in upcoming commits to maintain compatibility and create an upgrade/downgrade
path between LDK versions.
Cleans up the next commit
This field will be deprecated in upcoming commits when we start persisting
MonitorEvent ids alongside the MonitorEvents.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them.  This
simplify things - on restart instead of examining the set of HTLCs in monitors
we can simply replay all the pending MonitorEvents.

Here we add an as-yet-unused API to chain::Watch to allow the ChannelManager to
tell the a ChannelMonitor that a MonitorEvent has been irrevocably processed
and can be deleted.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them.  This
simplify things - on restart instead of examining the set of HTLCs in monitors
we can simply replay all the pending MonitorEvents.

To allow the ChannelManager to ack specific monitor events once they are
resolved in upcoming commits, here we give each MonitorEvent a corresponding
unique id. It's implemented in such a way that we can delete legacy monitor
event code in the future when the new persistent monitor events flag is enabled
by default.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here for the purposes of merging initial support for persistent monitor events,
we ack each immediately after it is received/handled by the ChannelManager,
which is equivalent to the behavior we had prior to monitor events becoming
persistent.

In upcoming work, we'll want to have much more special handling for HTLCUpdate
monitor events in particular -- e.g. for outbound payment claim events, we
should only ACK the monitor event when the PaymentSent event is processed,
until that point we want it to keep being provided back to us on startup.

All the other monitor events are trivial to ACK, since they don't need to be
re-processed on startup.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Here we complete work that was built on recent prior commits and actually start
re-providing monitor events on startup if they went un-acked during runtime.
This isn't actually supported in prod yet, so this new code will run randomly
in tests, to ensure we still support the old paths.
And log them in check_added_monitors if it fails.
Will be used in upcoming commits when generating MonitorEvents.
Processing MonitorEvent::HTLCEvent causes the ChannelManager to call
claim_funds_internal, but it will currently pass in None for the
user_channel_id parameter. In upcoming commits when we begin generating monitor
events for off-chain HTLC claims as well as onchain, we'll want to start using
an accurate value instead.
Used in an upcoming commit to insert a pending payment if it's missing on
startup and we need to re-claim.
Used in an upcoming commit to generate an
EventCompletionAction::AckMonitorEvent.
In upcoming commits, we'll be generating monitor events for off-chain claims as
well as on-chain. As a small prefactor, calculate the from_onchain value rather
than hardcoding it to true.
If ChannelManager::persistent_monitor_events is enabled, we may want to avoid
acking a monitor event until after an Event is processed by the user. In
upcoming commits, we'll use this to ensure a MonitorEvent::HTLCEvent will keep
being re-provided back to us until after an Event::PaymentSent is processed.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

In recent work, we added support for keeping monitor events around until they
are explicitly acked by the ChannelManager, but would always ack monitor events
immediately, which preserved the previous behavior and didn't break any tests.

Up until this point, we only generated HTLC monitor events when a payment was
claimed/failed on-chain.

In this commit, we start generating persistent monitor events whenever a payment
is claimed *off*-chain, specifically when new latest holder commitment data is
provided to the monitor.

For the purpose of making incremental progress on this feature, these events
will be a no-op and/or continue to be acked immediately except in the narrow
case of an off-chain outbound payment claim. HTLC forward claim monitor events
will be a no-op, and on-chain outbound payment claim events continue to be
acked immediately.

Off-chain outbound payment claims, however, now have monitor events generated
for them that will not be acked by the ChannelManager until the PaymentSent
event is processed by the user. This also allows us to stop blocking the RAA
monitor update that removes the preimage, because the purpose of that behavior
was to ensure the user got a PaymentSent event and the monitor event now serves
that purpose instead.
This isn't a bug at the moment because a claim in this situation would already
be filtered out due to its inclusion in htlcs_resolved_to_user. However, when
we stop issuing ReleasePaymentComplete monitor updates for claims in upcoming
commits, HTLC claims will no longer be in htlcs_resolved_to_user when
get_onchain_failed_htlcs checks.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

In recent work, we added support for keeping monitor events around until they
are explicitly acked by the ChannelManager, but would always ack monitor events
immediately, which preserved the previous behavior and didn't break any tests.

Here we start acking monitor events for on-chain HTLC claims when the user
processes the PaymentSent event, if the persistent_monitor_events feature is
enabled. This allows us to stop issuing ReleasePaymentComplete monitor updates
for onchain payment claims, because the purpose of that behavior is to ensure
we won't keep repeatedly issuing PaymentSent events, and the monitor event and
acking it on PaymentSent processing now serves that purpose instead.
We want to make sure we don't leak any monitor events by leaving them unacked
forever.
Use new ForwardEventContents struct as the return value of the closure passed
to claim_funds_from_htlc_forward_hop.

Useful as upcoming commits will add a struct that wants to contain a forward
event specifically, not just any Event.
We'll be adding another parameter to this closure soon. Also this makes it a
little clearer what's going on at the callsites.
Used in upcoming commits to attach a post-event action to the right forward
event. It's not as trivial to include this field for outbound HTLCs, so we just
include it for inbounds for now and keep the field private to the crate.
We need this level of precision when generating this event for off-chain
claims.

One test failed due to us previously overreporting a fee due to this lack of
precision, fixed now.
Introduce in-memory tracking and pruning for forwarded-payment monitor events
that should only be acked once the corresponding inbound HTLCs have all been
durably resolved and a corresponding PaymentForwarded event has been processed.
(Note that there will only ever be 1 inbound HTLC and 1 outbound-edge-generated
monitor event, unless the forward is trampoline + MPP.)

Will be used in upcoming commits once we start relying on persistent
MonitorEvents to resolve off-chain forwarded payment claims. We'll add to this
new map when the event comes in, and prune from it when the inbound HTLCs are
fully resolved and there's no need for the monitor event to replay anymore, at
which point the monitor event will be acked.
If an inbound HTLC is about to be fully removed from the channel via
revoke_and_ack, we should ack any monitor event corresponding to that HTLC when
the monitor update associated with the RAA is complete.

We do this so the monitor event will keep being re-provided to us on startup
until the HTLC is removed, to ensure the HTLC gets resolved even if we lose the
holding cell.

In future commits, we will also queue this action for closed channels to fire
when the preimage update completes.
If an inbound HTLC is about to be fully removed from the channel via
revoke_and_ack, we should ack any monitor event corresponding to that HTLC when
the monitor update associated with the RAA is complete. The RAA update may be
temporarily blocked and then unblocked later, which is the case we handle here.

We do this so the monitor event will keep being re-provided to us on startup
until the HTLC is removed, to ensure the HTLC gets resolved even if we lose the
holding cell.
If we go to process a monitor event from an outbound edge HTLC claim and
discover that the corresponding inbound edge HTLC was already fully removed
from the channel, ack the monitor event immediately so it doesn't dangle
forever.

This can happen if, for example, the claim came from the outbound edge
counterparty claiming the HTLC *onchain* after it was already fully removed
from the inbound edge *off-chain*.
Indicates we should emit an Event::PaymentForwarded and ack a MonitorEvent that
the outbound edge monitor is waiting to remove.

This is generated when we've completed an inbound edge preimage update for an
HTLC forward, at which point it's safe to generate the forward event. It
replaces the previous variant EmitEventOptionAndFreeOtherChannel because when
persistent monitor events are enabled, there is no channel to unblock.
When a channel shuts down, there may be HTLCs that were mid-fulfill at the
time, i.e. we had sent the fulfill upstream but not yet durably removed the
inbound HTLC via RAA monitor update.

If persistent monitor events is enabled, then the outbound edge monitor events
for these HTLCs need to be explicitly acked on channel close to prevent a
duplicate PaymentForward + preimage update the next time we restart.

The reason for this is that in the normal case when we go to claim upstream and
the inbound edge is open, the expected behavior is that the monitor event will
be acked when the HTLC is removed via RAA. If the channel then shuts down, the
RAA hook never fires and the monitor event will be replayed on startup, leading
to the duplicate event/monitor update.

It wouldn't be so bad to just accept this one-time duplicate event and update,
but may as well handle it.
Currently, the resolution of HTLCs (and decisions on when HTLCs can be
forwarded) is the responsibility of Channel objects (a part of ChannelManager)
until the channel is closed, and then the ChannelMonitor thereafter. This leads
to some complexity around race conditions for HTLCs right around channel
closure. Additionally, there is lots of complexity reconstructing the state of
all HTLCs in the ChannelManager deserialization/loading logic.

Instead, we want to do all resolution in ChannelMonitors (in response to
ChannelMonitorUpdates) and pass them back to ChannelManager in the form of
MonitorEvents (similar to how HTLCs are resolved after channels are closed). In
order to have reliable resolution, we'll need to keep MonitorEvents around in
the ChannelMonitor until the ChannelManager has finished processing them. This
will simplify things - on restart instead of examining the set of HTLCs in
monitors we can simply replay all the pending MonitorEvents.

Building on the work of prior commits, here we stop immediately acking monitor
events for HTLC claims for forwarded HTLCs, and instead ack the monitor event
when the forward is fully resolved and we don't want the monitor event to be
re-provided back to us on startup.

If the inbound edge channel is on-chain, we'll ack the event when the inbound
edge preimage monitor update completes and the user processes a
PaymentForwarded event. If the inbound edge is open, we'll ack the event when
the inbound HTLC is fully removed via revoke_and_ack (this ensures we'll
remember to resolve the htlc off-chain even if we lose the holding cell).
We're moving towards having the ChannelMonitor be responsible for resolving
HTLCs, via MonitorEvents, so we need to start tracking more payment information
in monitor updates so the monitor later has access to it. Here we add tracking
for the skimmed_fee field for claims.
We're moving towards having the ChannelMonitor be responsible for resolving
HTLCs, via MonitorEvents, so we need to start surfacing more information in
monitor events.

Here we start persisting/surfacing a specific HTLC failure reason and the
skimmed fee in MonitorEvent::HTLCEvents, which is useful when generating
and handling monitor events for off-chain claims/fails.

The skimmed fee for forwards is already put to use in this commit for
generating correct PaymentForwarded events. The failure reason will be used in
upcoming commits. We add the failure reason now to not have to change what
we're serializing to disk later.
Useful to create accurate monitor events for HTLC failures in upcoming commits,
even when the channel is still open.
Useful to create accurate monitor events for HTLC failures, even when the
channel is still open.
Useful for the monitor to surface these failure reasons in
MonitorEvent::HTLCEvents, which the manager will use to resolve these HTLCs.
These events are a no-op right now, which preserves the previous behavior.  In
upcoming commits, they will enable us to move onchain htlc failure resolution
from Channels to ChannelMonitors as part of a larger refactor where monitors
resolve all HTLCs.
Generalize fail_htlc's completion-action parameter from PaymentCompleteUpdate
to EventCompletionAction so the upcoming persistent-monitor-events path can
pass AckMonitorEvent.
We are in the process of moving the resolution of off-chain HTLCs from the
Channel to the ChannelMonitor, via MonitorEvents. This will simplify how we
reconstruct/reconcile the ChannelManager's pending HTLC set on startup and
avoid needing to persist pending HTLCs explicitly in the manager, as we can
just replay pending monitor events to reconstruct the set instead.

In a recent commit, we started generating persistent monitor events whenever an
htlc is failed off-chain, but those events were no-ops at the time.

In this commit, we make those monitor events the sole driver of off-chain
outbound payment failure. When `persistent_monitor_events` is enabled, we will
skip the existing failure path for these HTLCs and instead fail them when the
monitor event is processed. Similar to what we do for claims, the monitor event
will not be acked and will continue to be re-provided on startup until the
resulting PaymentFailed event is processed by the user.

Forward and on-chain failure paths are unchanged and will be migrated
separately as part of the broader refactor.
DRYs code in an upcoming commit.
We are in the process of moving the resolution of all HTLCs from the Channel to
the ChannelMonitor, via persistent MonitorEvents. This will simplify how we
reconstruct/reconcile the ChannelManager's pending HTLC set on startup and
avoid needing to persist pending HTLCs explicitly in the manager, as we can
just replay pending monitor events to reconstruct the set instead.

In recent commits, we started generating persistent monitor events whenever an
HTLC is failed off-chain. We also made those monitor events the sole driver of
off-chain outbound payment failures, failing those HTLCs when the monitor event
is processed. This replaced the previous behavior of failing when we initially
receive update_fail_htlc from our peer.

In this commit, we continue that work by making persistent monitor events the
sole drivers of *on*chain outbound payment failures. When
persistent_monitor_events is enabled, we will skip existing failure paths for
on-chain outbound HTLCs, and instead (a) fail them when the monitor event is
processed, and (b) similar to what we do for claims, the monitor event will not
be acked and will continue to be re-provided on startup until the resulting
PaymentFailed event is processed by the user.

Forward HTLC failure paths are unchanged and will be migrated separately in
upcoming work.
Will be used in an upcoming commit so we can ack a monitor event resolving an
HTLC forward in the case that the inbound edge is closed at fail-back time.
Without this piece of info, we can't locate the relevant event.
Recently, we began generating monitor events when HTLCs fail
off-chain. In upcoming commits, we'll begin relying on those events
to resolve forwarded HTLC fails on the inbound edge.

In some situations, and in part because monitor events are re-provided until
they are explicitly acked, we may end up attempting to fail an HTLC upstream
when it has already been fully removed from the inbound edge channel. If we
hit this case, we know that we don't need the outbound edge monitor event
resolution anymore, and can ack it so it doesn't get re-provided again.
Here we add a return value to the relevant Channel methods so the
ChannelManager knows when we've hit this case. It'll be used in upcoming
commits when we start relying on monitor events for off-chain forward fails.
Recently, we began generating monitor events when HTLCs fail off-chain. In
upcoming commits, we'll begin relying on those events to resolve forwarded HTLC
fails on the inbound edge. As such, we need to thread the hold time to the
monitor so the generated event is correct.
Recently, we began generating monitor events when HTLCs fail
off-chain. In upcoming commits, we'll begin relying on those events
to resolve forwarded HTLC fails on the inbound edge.

One requirement for that is that we can't fail an HTLC backwards from a monitor
event until the monitor update removing that HTLC on the outbound edge is
durably persisted. Otherwise, if we crash and lose that monitor update at the
wrong time, the outbound edge counterparty could theoretically change their
fail resolution to a claim and we'd have no recourse because we already failed
the HTLC back on the inbound edge.

As such, here we begin not surfacing HTLC-related monitor events to the
ChannelManager if there are in-flight monitor updates present, to prevent this
scenario.
We are in the process of moving the resolution of all HTLCs from the Channel to
the ChannelMonitor, via persistent MonitorEvents. This will simplify how we
reconstruct/reconcile the ChannelManager's pending HTLC set on startup and
avoid needing to persist pending HTLCs explicitly in the manager, as we can
just replay pending monitor events to reconstruct the set instead.

In recent commits, we started generating persistent monitor events whenever an
HTLC is failed off-chain. We also made those monitor events the sole driver of
outbound payment failures, failing those HTLCs when the monitor event is
processed. This replaced the previous behavior of failing when we initially
receive update_fail_htlc from our peer.

In this commit, we continue that work by making persistent monitor events the
sole drivers of *forwarded* payment failures. When persistent_monitor_events is
enabled, we will skip existing failure paths for forwarded HTLCs, and instead
(a) fail them when the monitor event is processed, and (b) similar to what we
do for claims, the monitor event will not be acked and will continue to be
re-provided on startup until the HTLC is durably removed on the inbound edge.
@ldk-reviews-bot
Copy link
Copy Markdown

👋 Hi! I see this is a draft PR.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants