Skip to content

ROX-34502: read TLS certificates from disk on each gRPC connection attempt#788

Open
vladbologa wants to merge 2 commits into
mainfrom
vb/hot-reload-tls-certs
Open

ROX-34502: read TLS certificates from disk on each gRPC connection attempt#788
vladbologa wants to merge 2 commits into
mainfrom
vb/hot-reload-tls-certs

Conversation

@vladbologa

@vladbologa vladbologa commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Description

The gRPC client previously read mTLS certificates once per run() invocation and reused the same TLS connector for all reconnection attempts. After certificate rotation on disk, fact would keep using the old certs until a config change forced a full client restart.

Move get_connector() inside the reconnection loop so certificates are re-read from the configured directory (ca.pem, cert.pem, key.pem) on each connection attempt. Active streams are not dropped; new certs are used only when establishing the next connection (after a disconnect, stream error, or config reload).

In Kubernetes, secret volume mounts update atomically via a symlink swap, so all three cert files change together. There should be no race condition where fact reads a mix of old and new certificates.

Checklist

  • Patch has a change log entry OR does not need one.
  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Tested certificate hot-reloading end-to-end on a OpenShift cluster with StackRox installed via the operator.

Setup

  1. Enabled FACT in the SecuredCluster CR (spec.perNode.fileActivityMonitoring.mode: Enabled)
  2. Generated a new collector certificate signed by the same CA (from central-tls), with a distinct CN (COLLECTOR_SERVICE: regenerated-<timestamp>) and fingerprint to distinguish it from the original
  3. Scaled down the operator and real sensor
  4. Deployed a fake sensor, a minimal gRPC server that only implements FileActivityService.Communicate, so every logged connection is guaranteed to be from FACT (not collector or compliance)

Results

Baseline (mainline FACT, no hot-reload)

Step Observed cert Result
FACT connects to fake sensor Original cert (03d391ae...) OK
Replace tls-cert-collector secret, wait, kill fake-sensor to force reconnect Original cert (03d391ae...) No hot-reload - FACT kept using the cert loaded at startup

PR build (quay.io/stackrox-io/fact:0.3.x-64-g94f0d82e55)

Step Observed cert Result
FACT starts with regenerated cert in secret Regenerated cert (44e491aa...) Loaded at startup
Swap secret to original, wait 2 minutes, kill fake-sensor Original cert (03d391ae...) Hot-reload works
Swap secret back to regenerated, wait 2 minutes, kill fake-sensor Regenerated cert (44e491aa...) Hot-reload works

Summary by CodeRabbit

  • Bug Fixes
    • mTLS certificates are now reloaded on each gRPC connection attempt instead of being loaded once before retries. This ensures the latest certificate configuration is applied consistently across all connection attempts.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 3f7484f7-ec56-4ebb-96a3-a625f1950d93

📥 Commits

Reviewing files that changed from the base of the PR and between d292312 and 66cef7e.

📒 Files selected for processing (1)
  • CHANGELOG.md
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md

📝 Walkthrough

Walkthrough

Moves TLS connector acquisition into the gRPC client's reconnect loop so mTLS certificates are reloaded on every connection attempt; CHANGELOG updated to note ROX-34502.

Changes

mTLS Certificate Reloading

Layer / File(s) Summary
mTLS Certificate Reloading Implementation and Documentation
fact/src/output/grpc.rs, CHANGELOG.md
The Client::run method moves connector fetching (get_connector().await?) inside the reconnect loop, ensuring fresh TLS certificates are obtained on each connection attempt. The changelog documents this change under ROX-34502.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I’m a rabbit in a tiny lab,
I hop for certs, I do the grab,
Each reconnect I nip and cheer,
Fresh TLS fetched — no stale one here,
Hooray for secure hops, my dear! 🐇🔐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically summarizes the main change—moving TLS certificate reading inside the gRPC reconnection loop for hot-reload capability.
Description check ✅ Passed The description is comprehensive, covering motivation, implementation details, testing methodology, and checklist items; all required template sections are addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vb/hot-reload-tls-certs

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter

codecov-commenter commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.96%. Comparing base (5310516) to head (66cef7e).

Files with missing lines Patch % Lines
fact/src/output/grpc.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #788   +/-   ##
=======================================
  Coverage   27.96%   27.96%           
=======================================
  Files          21       21           
  Lines        2596     2596           
  Branches     2596     2596           
=======================================
  Hits          726      726           
  Misses       1867     1867           
  Partials        3        3           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vladbologa vladbologa force-pushed the vb/hot-reload-tls-certs branch from 0c35316 to d292312 Compare June 9, 2026 13:23
@vladbologa vladbologa marked this pull request as ready for review June 9, 2026 14:34
@vladbologa vladbologa requested a review from a team as a code owner June 9, 2026 14:34
@vladbologa vladbologa requested a review from Molter73 June 9, 2026 14:35

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
fact/src/output/grpc.rs (1)

123-136: 🏗️ Heavy lift

Add a regression test for cert rotation across reconnects.

This change is correct, but it currently relies on manual verification. Please add an automated integration test that rotates ca.pem/cert.pem/key.pem, forces reconnect, and asserts the next connection succeeds with the new cert set.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/output/grpc.rs` around lines 123 - 136, Add an automated integration
test that verifies certificate rotation is picked up across reconnects: create a
test that spins up a test gRPC server and a client using the run() loop (or
directly exercising get_connector() and create_channel()), write an initial cert
set (ca.pem/cert.pem/key.pem) in a temp dir, establish a successful connection,
then replace those files with a rotated cert set, trigger a reconnect (e.g.,
stop the server or drop the channel so run() retries), and assert that a
subsequent create_channel() / connection attempt succeeds using the new certs;
use temporary directories and deterministic waits/timeouts to avoid flakiness
and reference the run, get_connector, and create_channel functions to locate
where the reconnect behavior is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@fact/src/output/grpc.rs`:
- Around line 123-136: Add an automated integration test that verifies
certificate rotation is picked up across reconnects: create a test that spins up
a test gRPC server and a client using the run() loop (or directly exercising
get_connector() and create_channel()), write an initial cert set
(ca.pem/cert.pem/key.pem) in a temp dir, establish a successful connection, then
replace those files with a rotated cert set, trigger a reconnect (e.g., stop the
server or drop the channel so run() retries), and assert that a subsequent
create_channel() / connection attempt succeeds using the new certs; use
temporary directories and deterministic waits/timeouts to avoid flakiness and
reference the run, get_connector, and create_channel functions to locate where
the reconnect behavior is exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 07767001-17e4-4b20-a976-e969fdb661fa

📥 Commits

Reviewing files that changed from the base of the PR and between 5310516 and d292312.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • fact/src/output/grpc.rs

@Molter73 Molter73 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes fact will keep active connections going indefinitely until it gets disconnected from sensor, is there a chance this disconnect doesn't happen and the certificates keep being used past their expiration date?

This is not for you to address BTW, but we can look into dropping active connections when a certificate change happens if this is required.

Comment thread CHANGELOG.md Outdated
Co-authored-by: Mauro Ezequiel Moltrasio <mmoltras@redhat.com>
@vladbologa

vladbologa commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

With these changes fact will keep active connections going indefinitely until it gets disconnected from sensor, is there a chance this disconnect doesn't happen and the certificates keep being used past their expiration date?

This is not for you to address BTW, but we can look into dropping active connections when a certificate change happens if this is required.

It's not a problem, we do that everywhere else. Certificates have to be valid only at handshake time.

We also refresh certs before they reach 50% of validity time, so most of the time this won't be an issue in practice. With short lived certificates it can happen, but it works (in fact I just tested over this weekend, where a Sensor <-> Central connection survived for days even though the cert was valid for only 2h).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants