[networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe by jing2uo · Pull Request #168 · alauda/knowledge

jing2uo · 2026-04-22T15:34:50Z

新增一篇 ACP KB 文章。

coderabbitai · 2026-04-22T15:36:57Z

Walkthrough

A new troubleshooting guide is added to document an end-to-end procedure for localizing intermittent ingress latency by running parallel continuous probes through the external load balancer and directly to gateway pod IPs, with interpretation guidance and diagnostic fallback steps for cases where both paths appear healthy.

Changes

Ingress Latency Diagnosis Guide

Layer / File(s)	Summary
Overview & Main Procedure `docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md` (lines 1–99)	Step-by-step workflow to create a diagnostic namespace, export probe configuration, gather gateway pod IPs, run two long-running curl probes (one through the load balancer, one bypassed), capture per-request latency and HTTP codes, and interpret results via a decision table to pinpoint bottleneck location.
Diagnostic Fallback Steps `docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md` (lines 100–118)	Guidance for cases where both probes appear healthy, covering DNS resolution verification, external network tracing, packet capture coordination, gateway metrics interpretation, and operational notes on namespace cleanup and probe lifecycle management.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

oilbeater
fanzy618

Poem

🐰 Latency hides in network lanes,
Two probes dance through load-balancer chains,
Load balancer path, then gateway bypass—
Compare the metrics, find what's amiss!
Troubleshoot with grace, your ingress now flies. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding documentation for localizing ingress latency using curl timing and LB-bypass probes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kb/2026-04/localising-ingress-latency-is-it-the-ext

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md (3)
118-119: 💤 Low value

Cleanup wording is slightly contradictory.

The doc says:

“Leaving them running indefinitely is fine operationally”

but also “the loop runs forever—delete the namespace when finished”

These conflict a bit. Reword to clarify that leaving them running for the duration of your investigation is fine, and namespace deletion is recommended immediately after.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 118 - 119, Update the contradictory cleanup wording in the
paragraph that currently contains the phrases “Leaving them running indefinitely
is fine operationally” and “the loop runs forever—delete the namespace when
finished”: rephrase to say leaving probe pods running is acceptable only for the
duration of your investigation and explicitly recommend deleting the namespace
(or stopping the probes) immediately after the investigation completes to avoid
ongoing load balancer costs.
66-71: ⚡ Quick win

Improve HOST parsing for robustness (ports / unexpected URL forms).

Step 4 derives HOST via:

HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}")

This assumes a simple https://host/path structure with no explicit port. If the URL includes a port (or if the path structure is slightly different), HOST can become host:port, while --resolve "$HOST:443:$ip" would be semantically inconsistent for SNI/Host matching and can break the “same hostname via TLS SNI” intent.

At minimum, document “URL must be https:/// (no explicit non-443 port)”. Better: strip any :port from HOST before using it in --resolve.
Suggested change (strip port from hostname)
-  HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}")
+  HOST=$(printf "%s" "'$URL'" \
+    | sed -E 's#^https?://([^/]+).*$#\1#' \
+    | cut -d: -f1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 66 - 71, The HOST extraction can include an explicit port (e.g.
host:1234) which breaks the intended SNI/--resolve mapping; update the script so
after computing HOST you strip any ":port" suffix into a new variable (e.g.
HOST_NO_PORT) and use that when calling --resolve and when passing the SNI/Host
header, and add a short validation note that URL should be https://... or fail
early if scheme is not https; locate the HOST assignment and all occurrences of
--resolve "$HOST:443:$ip" (and any direct uses of HOST for TLS/SNI) and replace
them to use the port-stripped hostname variable instead.
54-56: ⚡ Quick win

-k/--insecure should be framed as troubleshooting-only.

Both probes use curl -sk. -k suppresses TLS verification, which is sometimes acceptable for internal endpoints, but the doc reads like a general procedure; it’s easy for readers to copy/paste and accidentally normalize insecure TLS behavior.

Consider adding a one-liner near the snippets, e.g. “Remove -k if your CA is trusted; keep it only when certs are self-signed/untrusted.”

Also applies to: 71-72
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`
around lines 54 - 56, The curl examples use the -k/--insecure flag (see the
"curl -sk --noproxy \"*\" --connect-timeout 2 -w ..." snippet), which should be
explicitly marked as troubleshooting-only; add a concise one-line note adjacent
to both curl snippets advising to remove -k when the CA is trusted (or replace
with a proper --cacert/CA-config) and to only use -k for self-signed/untrusted
certs during troubleshooting so readers don’t normalize insecure TLS behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`:
- Around line 48-76: The curl probes (kubectl run curl-lb and curl-bypass) only
use --connect-timeout which limits TCP connect but not the whole request, so the
probe can hang during TLS handshake or slow reads; update both curl invocations
in the curl-lb and curl-bypass command strings to add a total timeout flag like
--max-time 10 (or -m 10) alongside --connect-timeout 2 to ensure each curl call
cannot block indefinitely and the probe loops continue to iterate.

---

Nitpick comments:
In
`@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md`:
- Around line 118-119: Update the contradictory cleanup wording in the paragraph
that currently contains the phrases “Leaving them running indefinitely is fine
operationally” and “the loop runs forever—delete the namespace when finished”:
rephrase to say leaving probe pods running is acceptable only for the duration
of your investigation and explicitly recommend deleting the namespace (or
stopping the probes) immediately after the investigation completes to avoid
ongoing load balancer costs.
- Around line 66-71: The HOST extraction can include an explicit port (e.g.
host:1234) which breaks the intended SNI/--resolve mapping; update the script so
after computing HOST you strip any ":port" suffix into a new variable (e.g.
HOST_NO_PORT) and use that when calling --resolve and when passing the SNI/Host
header, and add a short validation note that URL should be https://... or fail
early if scheme is not https; locate the HOST assignment and all occurrences of
--resolve "$HOST:443:$ip" (and any direct uses of HOST for TLS/SNI) and replace
them to use the port-stripped hostname variable instead.
- Around line 54-56: The curl examples use the -k/--insecure flag (see the "curl
-sk --noproxy \"*\" --connect-timeout 2 -w ..." snippet), which should be
explicitly marked as troubleshooting-only; add a concise one-line note adjacent
to both curl snippets advising to remove -k when the CA is trusted (or replace
with a proper --cacert/CA-config) and to only use -k for self-signed/untrusted
certs during troubleshooting so readers don’t normalize insecure TLS behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34ccbf8d-4243-4a53-9b5f-4fb610aae498

📥 Commits

Reviewing files that changed from the base of the PR and between c8b7b50 and 4d36b3b.

📒 Files selected for processing (1)

docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md

coderabbitai · 2026-05-02T10:26:52Z

+3. **Run the LB-fronted probe** — this is what your users see. It records per-request latency and HTTP response code:
+
+   ```bash
+   kubectl -n ingress-latency-probe run curl-lb --restart=Never \
+     --image=curlimages/curl:8.10.1 --command -- sh -ec '
+       while :; do
+         curl -sk --noproxy "*" --connect-timeout 2 \
+           -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
+           -o /dev/null "'$URL'" || true
+         sleep 5
+       done'
+   ```
+
+4. **Run the bypass probe** — same URL, but resolve the hostname to each gateway IP in turn via `curl --resolve`. TLS SNI still matches the certificate because the Host header and SNI host stay the same.
+
+   ```bash
+   kubectl -n ingress-latency-probe run curl-bypass --restart=Never \
+     --image=curlimages/curl:8.10.1 --command -- sh -ec '
+       HOST=$(printf "%s" "'$URL'" | awk -F/ "{print \$3}")
+       while :; do
+         for ip in '"$(echo $GATEWAY_IPS | tr "\n" " ")"'; do
+           curl -sk --noproxy "*" --connect-timeout 2 \
+             --resolve "$HOST:443:$ip" \
+             -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \
+             -o /dev/null "'$URL'" || true
+         done
+         sleep 5
+       done'
+   ```


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--connect-timeout isn’t enough: add --max-time (or -m) to prevent curl hangs.

Both probe loops run forever and each iteration executes curl ... --connect-timeout 2 .... That only caps connection establishment. TLS handshake, waiting for first byte, or slow reads can still make the curl command block well beyond the intended time window, which can (a) make your “fast/slow” classification misleading and (b) stall the probe loop entirely.

Add a total timeout, e.g. --max-time 10 (or similar), to both curl invocations:

Suggested change

- curl -sk --noproxy "*" --connect-timeout 2 \ + curl -sk --noproxy "*" --connect-timeout 2 --max-time 10 \ -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \ -o /dev/null "'$URL'" || true

and

- curl -sk --noproxy "*" --connect-timeout 2 \ + curl -sk --noproxy "*" --connect-timeout 2 --max-time 10 \ --resolve "$HOST:443:$ip" \ -w "remote_ip=%{remote_ip} code=%{response_code} connect=%{time_connect} total=%{time_total}\n" \ -o /dev/null "'$URL'" || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/en/solutions/Localising_Ingress_Latency_Is_It_the_External_Load_Balancer_or_the_Gateway_Pods.md` around lines 48 - 76, The curl probes (kubectl run curl-lb and curl-bypass) only use --connect-timeout which limits TCP connect but not the whole request, so the probe can hang during TLS handshake or slow reads; update both curl invocations in the curl-lb and curl-bypass command strings to add a total timeout flag like --max-time 10 (or -m 10) alongside --connect-timeout 2 to ensure each curl call cannot block indefinitely and the probe loops continue to iterate.

…ng and an LB-bypass probe

jing2uo requested a review from oilbeater April 22, 2026 15:34

jing2uo temporarily deployed to translate April 22, 2026 15:34 — with GitHub Actions Inactive

jing2uo requested review from black89757, danielfbm and fanzy618 April 22, 2026 15:34

danielfbm removed their request for review April 24, 2026 02:29

jing2uo temporarily deployed to translate May 2, 2026 10:24 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

jing2uo temporarily deployed to translate May 2, 2026 13:40 — with GitHub Actions Inactive

[networking] Localising ingress request latency on ACP with curl timi…

8f2b7e7

…ng and an LB-bypass probe

jing2uo force-pushed the kb/2026-04/localising-ingress-latency-is-it-the-ext branch from fd3176a to 8f2b7e7 Compare May 27, 2026 03:22

jing2uo changed the title ~~[networking] Localising Ingress Latency — Is It the External Load Balancer or the Gateway Pods?~~ [networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe May 27, 2026

jing2uo temporarily deployed to translate May 27, 2026 03:22 — with GitHub Actions Inactive

jing2uo merged commit e3969b4 into main May 27, 2026
2 checks passed

jing2uo deleted the kb/2026-04/localising-ingress-latency-is-it-the-ext branch May 27, 2026 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe#168

[networking] Localising ingress request latency on ACP with curl timing and an LB-bypass probe#168
jing2uo merged 1 commit into
mainfrom
kb/2026-04/localising-ingress-latency-is-it-the-ext

jing2uo commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jing2uo commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jing2uo commented Apr 22, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading