ROX-34502: read TLS certificates from disk on each gRPC connection attempt#788
ROX-34502: read TLS certificates from disk on each gRPC connection attempt#788vladbologa wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughMoves TLS connector acquisition into the gRPC client's reconnect loop so mTLS certificates are reloaded on every connection attempt; CHANGELOG updated to note ROX-34502. ChangesmTLS Certificate Reloading
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #788 +/- ##
=======================================
Coverage 27.96% 27.96%
=======================================
Files 21 21
Lines 2596 2596
Branches 2596 2596
=======================================
Hits 726 726
Misses 1867 1867
Partials 3 3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
0c35316 to
d292312
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
fact/src/output/grpc.rs (1)
123-136: 🏗️ Heavy liftAdd a regression test for cert rotation across reconnects.
This change is correct, but it currently relies on manual verification. Please add an automated integration test that rotates
ca.pem/cert.pem/key.pem, forces reconnect, and asserts the next connection succeeds with the new cert set.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fact/src/output/grpc.rs` around lines 123 - 136, Add an automated integration test that verifies certificate rotation is picked up across reconnects: create a test that spins up a test gRPC server and a client using the run() loop (or directly exercising get_connector() and create_channel()), write an initial cert set (ca.pem/cert.pem/key.pem) in a temp dir, establish a successful connection, then replace those files with a rotated cert set, trigger a reconnect (e.g., stop the server or drop the channel so run() retries), and assert that a subsequent create_channel() / connection attempt succeeds using the new certs; use temporary directories and deterministic waits/timeouts to avoid flakiness and reference the run, get_connector, and create_channel functions to locate where the reconnect behavior is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@fact/src/output/grpc.rs`:
- Around line 123-136: Add an automated integration test that verifies
certificate rotation is picked up across reconnects: create a test that spins up
a test gRPC server and a client using the run() loop (or directly exercising
get_connector() and create_channel()), write an initial cert set
(ca.pem/cert.pem/key.pem) in a temp dir, establish a successful connection, then
replace those files with a rotated cert set, trigger a reconnect (e.g., stop the
server or drop the channel so run() retries), and assert that a subsequent
create_channel() / connection attempt succeeds using the new certs; use
temporary directories and deterministic waits/timeouts to avoid flakiness and
reference the run, get_connector, and create_channel functions to locate where
the reconnect behavior is exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 07767001-17e4-4b20-a976-e969fdb661fa
📒 Files selected for processing (2)
CHANGELOG.mdfact/src/output/grpc.rs
There was a problem hiding this comment.
With these changes fact will keep active connections going indefinitely until it gets disconnected from sensor, is there a chance this disconnect doesn't happen and the certificates keep being used past their expiration date?
This is not for you to address BTW, but we can look into dropping active connections when a certificate change happens if this is required.
Co-authored-by: Mauro Ezequiel Moltrasio <mmoltras@redhat.com>
It's not a problem, we do that everywhere else. Certificates have to be valid only at handshake time. We also refresh certs before they reach 50% of validity time, so most of the time this won't be an issue in practice. With short lived certificates it can happen, but it works (in fact I just tested over this weekend, where a Sensor <-> Central connection survived for days even though the cert was valid for only 2h). |
Description
The gRPC client previously read mTLS certificates once per
run()invocation and reused the same TLS connector for all reconnection attempts. After certificate rotation on disk, fact would keep using the old certs until a config change forced a full client restart.Move
get_connector()inside the reconnection loop so certificates are re-read from the configured directory (ca.pem,cert.pem,key.pem) on each connection attempt. Active streams are not dropped; new certs are used only when establishing the next connection (after a disconnect, stream error, or config reload).In Kubernetes, secret volume mounts update atomically via a symlink swap, so all three cert files change together. There should be no race condition where fact reads a mix of old and new certificates.
Checklist
Automated testing
If any of these don't apply, please comment below.
Testing Performed
Tested certificate hot-reloading end-to-end on a OpenShift cluster with StackRox installed via the operator.
Setup
spec.perNode.fileActivityMonitoring.mode: Enabled)central-tls), with a distinct CN (COLLECTOR_SERVICE: regenerated-<timestamp>) and fingerprint to distinguish it from the originalFileActivityService.Communicate, so every logged connection is guaranteed to be from FACT (not collector or compliance)Results
Baseline (mainline FACT, no hot-reload)
03d391ae...)tls-cert-collectorsecret, wait, kill fake-sensor to force reconnect03d391ae...)PR build (
quay.io/stackrox-io/fact:0.3.x-64-g94f0d82e55)44e491aa...)03d391ae...)44e491aa...)Summary by CodeRabbit