Skip to content

[networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe#168

Merged
jing2uo merged 1 commit into
mainfrom
kb/2026-04/localising-ingress-latency-is-it-the-ext
May 27, 2026
Merged

[networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe#168
jing2uo merged 1 commit into
mainfrom
kb/2026-04/localising-ingress-latency-is-it-the-ext

Conversation

@jing2uo
Copy link
Copy Markdown
Collaborator

@jing2uo jing2uo commented Apr 22, 2026

新增一篇 ACP KB 文章。

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Walkthrough

A new troubleshooting guide is added to document an end-to-end procedure for localizing intermittent ingress latency by running parallel continuous probes through the external load balancer and directly to gateway pod IPs, with interpretation guidance and diagnostic fallback steps for cases where both paths appear healthy.

Changes

Ingress Latency Diagnosis Guide

Layer / File(s) Summary
Overview & Main Procedure
docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md (lines 1–99)
Step-by-step workflow to create a diagnostic namespace, export probe configuration, gather gateway pod IPs, run two long-running curl probes (one through the load balancer, one bypassed), capture per-request latency and HTTP codes, and interpret results via a decision table to pinpoint bottleneck location.
Diagnostic Fallback Steps
docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md (lines 100–118)
Guidance for cases where both probes appear healthy, covering DNS resolution verification, external network tracing, packet capture coordination, gateway metrics interpretation, and operational notes on namespace cleanup and probe lifecycle management.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • oilbeater
  • fanzy618

Poem

🐰 Latency hides in network lanes,
Two probes dance through load-balancer chains,
Load balancer path, then gateway bypass—
Compare the metrics, find what's amiss!
Troubleshoot with grace, your ingress now flies.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding documentation for localizing ingress latency using curl timing and LB-bypass probes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kb/2026-04/localising-ingress-latency-is-it-the-ext

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@danielfbm danielfbm removed their request for review April 24, 2026 02:29
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md (3)

118-119: 💤 Low value

Cleanup wording is slightly contradictory.

The doc says:

  • “Leaving them running indefinitely is fine operationally”
  • but also “the loop runs forever—delete the namespace when finished”

These conflict a bit. Reword to clarify that leaving them running for the duration of your investigation is fine, and namespace deletion is recommended immediately after.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 118 - 119, Update the contradictory cleanup wording in the
paragraph that currently contains the phrases “Leaving them running indefinitely
is fine operationally” and “the loop runs forever—delete the namespace when
finished”: rephrase to say leaving probe pods running is acceptable only for the
duration of your investigation and explicitly recommend deleting the namespace
(or stopping the probes) immediately after the investigation completes to avoid
ongoing load balancer costs.

66-71: ⚡ Quick win

Improve HOST parsing for robustness (ports / unexpected URL forms).

Step 4 derives HOST via:

  • HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}")

This assumes a simple https://host/path structure with no explicit port. If the URL includes a port (or if the path structure is slightly different), HOST can become host:port, while --resolve "$HOST:443:$ip" would be semantically inconsistent for SNI/Host matching and can break the “same hostname via TLS SNI” intent.

At minimum, document “URL must be https:/// (no explicit non-443 port)”. Better: strip any :port from HOST before using it in --resolve.

Suggested change (strip port from hostname)
-  HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}")
+  HOST=$(printf "%s" "'$URL'" \
+    | sed -E 's#^https?://([^/]+).*$#\1#' \
+    | cut -d: -f1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 66 - 71, The HOST extraction can include an explicit port (e.g.
host:1234) which breaks the intended SNI/--resolve mapping; update the script so
after computing HOST you strip any ":port" suffix into a new variable (e.g.
HOST_NO_PORT) and use that when calling --resolve and when passing the SNI/Host
header, and add a short validation note that URL should be https://... or fail
early if scheme is not https; locate the HOST assignment and all occurrences of
--resolve "$HOST:443:$ip" (and any direct uses of HOST for TLS/SNI) and replace
them to use the port-stripped hostname variable instead.

54-56: ⚡ Quick win

-k/--insecure should be framed as troubleshooting-only.

Both probes use curl -sk. -k suppresses TLS verification, which is sometimes acceptable for internal endpoints, but the doc reads like a general procedure; it’s easy for readers to copy/paste and accidentally normalize insecure TLS behavior.

Consider adding a one-liner near the snippets, e.g. “Remove -k if your CA is trusted; keep it only when certs are self-signed/untrusted.”

Also applies to: 71-72

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 54 - 56, The curl examples use the -k/--insecure flag (see the
"curl -sk --noproxy \"*\" --connect-timeout 2 -w ..." snippet), which should be
explicitly marked as troubleshooting-only; add a concise one-line note adjacent
to both curl snippets advising to remove -k when the CA is trusted (or replace
with a proper --cacert/CA-config) and to only use -k for self-signed/untrusted
certs during troubleshooting so readers don’t normalize insecure TLS behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`:
- Around line 48-76: The curl probes (kubectl run curl-lb and curl-bypass) only
use --connect-timeout which limits TCP connect but not the whole request, so the
probe can hang during TLS handshake or slow reads; update both curl invocations
in the curl-lb and curl-bypass command strings to add a total timeout flag like
--max-time 10 (or -m 10) alongside --connect-timeout 2 to ensure each curl call
cannot block indefinitely and the probe loops continue to iterate.

---

Nitpick comments:
In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`:
- Around line 118-119: Update the contradictory cleanup wording in the paragraph
that currently contains the phrases “Leaving them running indefinitely is fine
operationally” and “the loop runs forever—delete the namespace when finished”:
rephrase to say leaving probe pods running is acceptable only for the duration
of your investigation and explicitly recommend deleting the namespace (or
stopping the probes) immediately after the investigation completes to avoid
ongoing load balancer costs.
- Around line 66-71: The HOST extraction can include an explicit port (e.g.
host:1234) which breaks the intended SNI/--resolve mapping; update the script so
after computing HOST you strip any ":port" suffix into a new variable (e.g.
HOST_NO_PORT) and use that when calling --resolve and when passing the SNI/Host
header, and add a short validation note that URL should be https://... or fail
early if scheme is not https; locate the HOST assignment and all occurrences of
--resolve "$HOST:443:$ip" (and any direct uses of HOST for TLS/SNI) and replace
them to use the port-stripped hostname variable instead.
- Around line 54-56: The curl examples use the -k/--insecure flag (see the "curl
-sk --noproxy \"*\" --connect-timeout 2 -w ..." snippet), which should be
explicitly marked as troubleshooting-only; add a concise one-line note adjacent
to both curl snippets advising to remove -k when the CA is trusted (or replace
with a proper --cacert/CA-config) and to only use -k for self-signed/untrusted
certs during troubleshooting so readers don’t normalize insecure TLS behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34ccbf8d-4243-4a53-9b5f-4fb610aae498

📥 Commits

Reviewing files that changed from the base of the PR and between c8b7b50 and 4d36b3b.

📒 Files selected for processing (1)
  • docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md

Comment on lines +48 to +76
3. **Run the LB-fronted probe** — this is what your users see. It records per-request latency and HTTP response code:

```bash
kubectl -n ingress-latency-probe run curl-lb --restart=Never \
--image=curlimages/curl:8.10.1 --command -- sh -ec '
while :; do
curl -sk --noproxy "*" --connect-timeout 2 \
-w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
-o /dev/null "'$URL'" || true
sleep 5
done'
```

4. **Run the bypass probe** — same URL, but resolve the hostname to each gateway IP in turn via `curl --resolve`. TLS SNI still matches the certificate because the Host header and SNI host stay the same.

```bash
kubectl -n ingress-latency-probe run curl-bypass --restart=Never \
--image=curlimages/curl:8.10.1 --command -- sh -ec '
HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}")
while :; do
for ip in '"$(echo $GATEWAY_IPS | tr "\n" " ")"'; do
curl -sk --noproxy "*" --connect-timeout 2 \
--resolve "$HOST:443:$ip" \
-w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
-o /dev/null "'$URL'" || true
done
sleep 5
done'
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--connect-timeout isn’t enough: add --max-time (or -m) to prevent curl hangs.

Both probe loops run forever and each iteration executes curl ... --connect-timeout 2 .... That only caps connection establishment. TLS handshake, waiting for first byte, or slow reads can still make the curl command block well beyond the intended time window, which can (a) make your “fast/slow” classification misleading and (b) stall the probe loop entirely.

Add a total timeout, e.g. --max-time 10 (or similar), to both curl invocations:

Suggested change
-  curl -sk --noproxy "*" --connect-timeout 2 \
+  curl -sk --noproxy "*" --connect-timeout 2 --max-time 10 \
     -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
     -o /dev/null "'$URL'" || true

and

-  curl -sk --noproxy "*" --connect-timeout 2 \
+  curl -sk --noproxy "*" --connect-timeout 2 --max-time 10 \
     --resolve "$HOST:443:$ip" \
     -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
     -o /dev/null "'$URL'" || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 48 - 76, The curl probes (kubectl run curl-lb and curl-bypass) only
use --connect-timeout which limits TCP connect but not the whole request, so the
probe can hang during TLS handshake or slow reads; update both curl invocations
in the curl-lb and curl-bypass command strings to add a total timeout flag like
--max-time 10 (or -m 10) alongside --connect-timeout 2 to ensure each curl call
cannot block indefinitely and the probe loops continue to iterate.

@jing2uo jing2uo force-pushed the kb/2026-04/localising-ingress-latency-is-it-the-ext branch from fd3176a to 8f2b7e7 Compare May 27, 2026 03:22
@jing2uo jing2uo changed the title [networking] Localising Ingress Latency — Is It the External Load Balancer or the Gateway Pods? [networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe May 27, 2026
@jing2uo jing2uo merged commit e3969b4 into main May 27, 2026
2 checks passed
@jing2uo jing2uo deleted the kb/2026-04/localising-ingress-latency-is-it-the-ext branch May 27, 2026 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant