[networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe#168
Conversation
WalkthroughA new troubleshooting guide is added to document an end-to-end procedure for localizing intermittent ingress latency by running parallel continuous probes through the external load balancer and directly to gateway pod IPs, with interpretation guidance and diagnostic fallback steps for cases where both paths appear healthy. ChangesIngress Latency Diagnosis Guide
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md (3)
118-119: 💤 Low valueCleanup wording is slightly contradictory.
The doc says:
- “Leaving them running indefinitely is fine operationally”
- but also “the loop runs forever—delete the namespace when finished”
These conflict a bit. Reword to clarify that leaving them running for the duration of your investigation is fine, and namespace deletion is recommended immediately after.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md` around lines 118 - 119, Update the contradictory cleanup wording in the paragraph that currently contains the phrases “Leaving them running indefinitely is fine operationally” and “the loop runs forever—delete the namespace when finished”: rephrase to say leaving probe pods running is acceptable only for the duration of your investigation and explicitly recommend deleting the namespace (or stopping the probes) immediately after the investigation completes to avoid ongoing load balancer costs.
66-71: ⚡ Quick winImprove
HOSTparsing for robustness (ports / unexpected URL forms).Step 4 derives
HOSTvia:
HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}")This assumes a simple
https://host/pathstructure with no explicit port. If the URL includes a port (or if the path structure is slightly different),HOSTcan becomehost:port, while--resolve "$HOST:443:$ip"would be semantically inconsistent for SNI/Host matching and can break the “same hostname via TLS SNI” intent.At minimum, document “URL must be https:/// (no explicit non-443 port)”. Better: strip any
:portfromHOSTbefore using it in--resolve.Suggested change (strip port from hostname)
- HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}") + HOST=$(printf "%s" "'$URL'" \ + | sed -E 's#^https?://([^/]+).*$#\1#' \ + | cut -d: -f1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md` around lines 66 - 71, The HOST extraction can include an explicit port (e.g. host:1234) which breaks the intended SNI/--resolve mapping; update the script so after computing HOST you strip any ":port" suffix into a new variable (e.g. HOST_NO_PORT) and use that when calling --resolve and when passing the SNI/Host header, and add a short validation note that URL should be https://... or fail early if scheme is not https; locate the HOST assignment and all occurrences of --resolve "$HOST:443:$ip" (and any direct uses of HOST for TLS/SNI) and replace them to use the port-stripped hostname variable instead.
54-56: ⚡ Quick win
-k/--insecureshould be framed as troubleshooting-only.Both probes use
curl -sk.-ksuppresses TLS verification, which is sometimes acceptable for internal endpoints, but the doc reads like a general procedure; it’s easy for readers to copy/paste and accidentally normalize insecure TLS behavior.Consider adding a one-liner near the snippets, e.g. “Remove
-kif your CA is trusted; keep it only when certs are self-signed/untrusted.”Also applies to: 71-72
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md` around lines 54 - 56, The curl examples use the -k/--insecure flag (see the "curl -sk --noproxy \"*\" --connect-timeout 2 -w ..." snippet), which should be explicitly marked as troubleshooting-only; add a concise one-line note adjacent to both curl snippets advising to remove -k when the CA is trusted (or replace with a proper --cacert/CA-config) and to only use -k for self-signed/untrusted certs during troubleshooting so readers don’t normalize insecure TLS behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`:
- Around line 48-76: The curl probes (kubectl run curl-lb and curl-bypass) only
use --connect-timeout which limits TCP connect but not the whole request, so the
probe can hang during TLS handshake or slow reads; update both curl invocations
in the curl-lb and curl-bypass command strings to add a total timeout flag like
--max-time 10 (or -m 10) alongside --connect-timeout 2 to ensure each curl call
cannot block indefinitely and the probe loops continue to iterate.
---
Nitpick comments:
In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`:
- Around line 118-119: Update the contradictory cleanup wording in the paragraph
that currently contains the phrases “Leaving them running indefinitely is fine
operationally” and “the loop runs forever—delete the namespace when finished”:
rephrase to say leaving probe pods running is acceptable only for the duration
of your investigation and explicitly recommend deleting the namespace (or
stopping the probes) immediately after the investigation completes to avoid
ongoing load balancer costs.
- Around line 66-71: The HOST extraction can include an explicit port (e.g.
host:1234) which breaks the intended SNI/--resolve mapping; update the script so
after computing HOST you strip any ":port" suffix into a new variable (e.g.
HOST_NO_PORT) and use that when calling --resolve and when passing the SNI/Host
header, and add a short validation note that URL should be https://... or fail
early if scheme is not https; locate the HOST assignment and all occurrences of
--resolve "$HOST:443:$ip" (and any direct uses of HOST for TLS/SNI) and replace
them to use the port-stripped hostname variable instead.
- Around line 54-56: The curl examples use the -k/--insecure flag (see the "curl
-sk --noproxy \"*\" --connect-timeout 2 -w ..." snippet), which should be
explicitly marked as troubleshooting-only; add a concise one-line note adjacent
to both curl snippets advising to remove -k when the CA is trusted (or replace
with a proper --cacert/CA-config) and to only use -k for self-signed/untrusted
certs during troubleshooting so readers don’t normalize insecure TLS behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 34ccbf8d-4243-4a53-9b5f-4fb610aae498
📒 Files selected for processing (1)
docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md
| 3. **Run the LB-fronted probe** — this is what your users see. It records per-request latency and HTTP response code: | ||
|
|
||
| ```bash | ||
| kubectl -n ingress-latency-probe run curl-lb --restart=Never \ | ||
| --image=curlimages/curl:8.10.1 --command -- sh -ec ' | ||
| while :; do | ||
| curl -sk --noproxy "*" --connect-timeout 2 \ | ||
| -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \ | ||
| -o /dev/null "'$URL'" || true | ||
| sleep 5 | ||
| done' | ||
| ``` | ||
|
|
||
| 4. **Run the bypass probe** — same URL, but resolve the hostname to each gateway IP in turn via `curl --resolve`. TLS SNI still matches the certificate because the Host header and SNI host stay the same. | ||
|
|
||
| ```bash | ||
| kubectl -n ingress-latency-probe run curl-bypass --restart=Never \ | ||
| --image=curlimages/curl:8.10.1 --command -- sh -ec ' | ||
| HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}") | ||
| while :; do | ||
| for ip in '"$(echo $GATEWAY_IPS | tr "\n" " ")"'; do | ||
| curl -sk --noproxy "*" --connect-timeout 2 \ | ||
| --resolve "$HOST:443:$ip" \ | ||
| -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \ | ||
| -o /dev/null "'$URL'" || true | ||
| done | ||
| sleep 5 | ||
| done' | ||
| ``` |
There was a problem hiding this comment.
--connect-timeout isn’t enough: add --max-time (or -m) to prevent curl hangs.
Both probe loops run forever and each iteration executes curl ... --connect-timeout 2 .... That only caps connection establishment. TLS handshake, waiting for first byte, or slow reads can still make the curl command block well beyond the intended time window, which can (a) make your “fast/slow” classification misleading and (b) stall the probe loop entirely.
Add a total timeout, e.g. --max-time 10 (or similar), to both curl invocations:
Suggested change
- curl -sk --noproxy "*" --connect-timeout 2 \
+ curl -sk --noproxy "*" --connect-timeout 2 --max-time 10 \
-w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
-o /dev/null "'$URL'" || trueand
- curl -sk --noproxy "*" --connect-timeout 2 \
+ curl -sk --noproxy "*" --connect-timeout 2 --max-time 10 \
--resolve "$HOST:443:$ip" \
-w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
-o /dev/null "'$URL'" || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 48 - 76, The curl probes (kubectl run curl-lb and curl-bypass) only
use --connect-timeout which limits TCP connect but not the whole request, so the
probe can hang during TLS handshake or slow reads; update both curl invocations
in the curl-lb and curl-bypass command strings to add a total timeout flag like
--max-time 10 (or -m 10) alongside --connect-timeout 2 to ensure each curl call
cannot block indefinitely and the probe loops continue to iterate.
…ng and an LB-bypass probe
fd3176a to
8f2b7e7
Compare
新增一篇 ACP KB 文章。