perf: set TCP_NODELAY, cheap allocs, socket tuning, postprocess deferral on engine_* path#27
perf: set TCP_NODELAY, cheap allocs, socket tuning, postprocess deferral on engine_* path#27KA-ROM wants to merge 13 commits into
Conversation
| // Per-worker cache of (user_agent -> Counter) so that we only pay the | ||
| // `Family::get_or_create` cost the first time a UA is seen on this | ||
| // worker. Lookup is `&str`-keyed (no allocation on hit). | ||
| client_info_cache: HashMap<String, Counter>, |
There was a problem hiding this comment.
Can't the user agent just be anything the client sets? Do you need a metric that uses that as a label?
There was a problem hiding this comment.
generally, yes, it can.
practically, it's what rollup-boost sends (which is its version that doesn't change often)
| .conn_keep_alive(2 * timeout) | ||
| // 100ms grace is fine for loopback (op-rbuilder); raise if backend | ||
| // is ever non-loopback (would risk truncating responses on shutdown). | ||
| .disconnect_timeout(Duration::from_millis(100)) |
There was a problem hiding this comment.
Should this be configurable?
| // Per-worker cached metric handles. These are cheap to clone (each | ||
| // wraps an `Arc<AtomicU64>` internally) and bypass the per-request | ||
| // `Family::get_or_create` lookup on the hot path. | ||
| in_flight_client: Gauge, | ||
| in_flight_backend: Gauge, | ||
| proxy_failure_count: Counter, |
There was a problem hiding this comment.
suggestion:
perhaps better to create ProxyHttpMetrics type, intantiate it from shared.metrics, and make it a member of ProxyHttp => this way adding new metrics will require less boiler-plate.
| matches!( | ||
| name.as_str().to_ascii_lowercase().as_str(), | ||
| "connection" | "host" | "keep-alive" | "transfer-encoding" | ||
| ) | ||
| name, | ||
| &header::CONNECTION | | ||
| &header::HOST | | ||
| &header::TRANSFER_ENCODING | ||
| ) || name.as_str().eq_ignore_ascii_case("keep-alive") |
There was a problem hiding this comment.
question:
HTTP headers are case-insensitive. (e.g. CONNECTION, connection, and CoNnEcTiOn are all valid and all refer to the same header).
does this change respect that?
| // Hot path: read-only lookup keyed by &str (no allocation). | ||
| // Cold path: allocate the owned String key and resolve the | ||
| // counter from the metrics family exactly once per worker per | ||
| // distinct UA. | ||
| if let Some(entry) = this.client_info_cache.read_sync(user_agent, |_, c| c.clone()) { | ||
| entry.inc(); | ||
| } else { | ||
| let counter = metrics | ||
| .client_info | ||
| .get_or_create(&LabelsProxyClientInfo { | ||
| proxy: P::name(), | ||
| user_agent: user_agent.to_string(), | ||
| }) | ||
| .clone(); | ||
| counter.inc(); | ||
| // soft cap to prevent unbounded growth from hostile UA values | ||
| if this.client_info_cache.len() < 100 { | ||
| // Best-effort insert; races with another task on the same | ||
| // worker thread shouldn't happen given !Send actix workers, | ||
| // but tolerate insert errors regardless. | ||
| let _ = this.client_info_cache.insert_sync(user_agent.to_string(), counter); | ||
| } | ||
| } |
There was a problem hiding this comment.
suggestion:
this would be nice to encapsulate in ProxyHttpMetrics too
| let tcp_nodelay = actix_tls::connect::Connector::new( | ||
| actix_tls::connect::Resolver::default(), | ||
| ) | ||
| .service() | ||
| .map(|conn: actix_tls::connect::Connection<awc::http::Uri, tokio::net::TcpStream>| { | ||
| let _ = conn.io_ref().set_nodelay(true); | ||
| let _ = socket2::SockRef::from(conn.io_ref()).set_tcp_keepalive( | ||
| &socket2::TcpKeepalive::new().with_time(Duration::from_secs(60)), | ||
| ); | ||
| conn | ||
| }); |
There was a problem hiding this comment.
question:
w.d.y.t. about putting this behind CLI-switch? (so that the user could choose on case-by-case).
(TCP_NODELAY could still be the default).
|
|
||
| // Build the inner TCP connector ourselves so we can flip | ||
| // TCP_NODELAY on after each connect | ||
| use actix_service::ServiceExt as _; |
There was a problem hiding this comment.
suggestion:
imports should go to the top
| .connector(Connector::new().conn_keep_alive(2 * timeout).limit(connections_limit)) | ||
| .connector( | ||
| Connector::new() | ||
| .connector(tcp_nodelay) |
There was a problem hiding this comment.
question:
what about client side? originally I thought about setting TCP_NODELAY up there
| // Initiate the response stream first so the client sees no extra | ||
| // bookkeeping latency on the critical path, then file the request | ||
| // into the in-flight map for the eventual response postprocessor. | ||
| // The insert must complete before the response body stream finishes | ||
| // (which is when `postprocess_backend_response` calls `remove_sync`); | ||
| // this is guaranteed because actix won't begin polling the streaming | ||
| // body until after this synchronous handler returns. | ||
| let this_clone = this.clone(); | ||
| let res = Self::stream_to_client(this, req_id, conn_id, bknd_res); | ||
| this_clone.postprocess_client_request(req); | ||
| res |
There was a problem hiding this comment.
issue:
client request is post-processed (i.e. inserted into in-flight requests hashmap) first (i.e. before streaming the request to the backend) for a reason.
otherwise, there is a potential for race-condition when the response arrives while the request info is still not inserted.
| // Fast path: when nobody is logging the proxied request/response and | ||
| // there are no mirror peers configured, skip the (expensive) response | ||
| // decompress + parse and the (always-evaluated) `info!` formatting in | ||
| // `maybe_log_proxied_request_and_response`. We still parse the request | ||
| // for the jrpc method label so metrics keep their correct dimensions. | ||
| let log_off = | ||
| !inner.config().log_proxied_requests() && !inner.config().log_proxied_responses(); | ||
| let no_mirror = mirroring_peers.is_empty(); | ||
| if log_off && no_mirror { | ||
| // If the request body has already been parsed (intercept path) | ||
| // we can avoid a second decompress + parse. | ||
| if let Some(jrpc) = clnt_req.jrpc_meta.clone() { | ||
| Self::emit_metrics_on_proxy_success(&jrpc, &clnt_req, &bknd_res, metrics); | ||
| return; | ||
| } | ||
| if clnt_req.decompressed_size < clnt_req.size { | ||
| (clnt_req.decompressed_body, clnt_req.decompressed_size) = decompress( | ||
| clnt_req.body.clone(), | ||
| clnt_req.size, | ||
| clnt_req.info.content_encoding(), | ||
| ); | ||
| } | ||
| match serde_json::from_slice::<JrpcRequestMetaMaybeBatch>(&clnt_req.decompressed_body) { | ||
| Ok(jrpc) => { | ||
| Self::emit_metrics_on_proxy_success(&jrpc, &clnt_req, &bknd_res, metrics); | ||
| } | ||
| Err(err) => { | ||
| warn!( | ||
| proxy = P::name(), | ||
| request_id = %clnt_req.info.req_id, | ||
| connection_id = %clnt_req.info.conn_id, | ||
| worker_id = %worker_id, | ||
| error = ?err, | ||
| "Failed to parse json-rpc request", | ||
| ); | ||
| } | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| if clnt_req.decompressed_size < clnt_req.size { |
There was a problem hiding this comment.
question/issue:
finalise_proxying happens off the hot-path (request received/forwarded, response received/forwarded).
why do we need to optimise things here? (it makes things more complicated, but what do we win with this?)
| // Stash the parsed jrpc so `finalise_proxying` can avoid | ||
| // re-parsing the same body. | ||
| Some(Arc::new(jrpc)) |
There was a problem hiding this comment.
question:
how often is this code-path hit in practice?
| let mut clnt_res = Self::to_client_response(&bknd_res); | ||
|
|
||
| let preallocate = this.shared.config().prealloacated_response_buffer_size(); | ||
| let content_encoding = bknd_res.headers().get(header::CONTENT_ENCODING).cloned(); |
There was a problem hiding this comment.
question:
are we sure content-encoding is the only header we want to send back to the client?
Summary
Hot-path tuning for the
engine_*proxy path: removes per-request allocations and redundant lookups, tunes backend TCP socket options, and defers/skips postprocess when not needed. One file changed (proxy.rs) plus a small helper. No flag or config surface added.Removed from the hot path
is_hop_by_hop_headermatchesHeaderNameconstants directly; per-call lowercasedStringis gone.Gauge/Counter+proxy_failure_counthandles resolved once per worker, cached onProxyHttp.scc::HashMap<String, Counter>keyed by&str, soft-capped at 100 entries to bound memory under hostile UAs.connection_info()bound once/request instead of 5x.headers_mut().append(...)skips reparsing names through awc's builder.HeaderName::from_str(backend names already valid).HeaderMapclone: capture onlycontent-encoding(the one field read downstream).JrpcRequestMetaMaybeBatchstashed onProxiedHttpRequest, reused byfinalise_proxying.Behavior changes (observable on wire / in ops)
TCP_NODELAYon backend conns: disables Nagle; cuts up to ~40ms tail on smallengine_*pairs (Nagle + LinuxTCP_DELACK_MIN=40ms).TCP_KEEPALIVEon backend conns (60s): catches silently-dead backends so the idle pool stops handing out broken sockets.disconnect_timeouttightened to 100ms: faster idle-conn reclamation. Safe for loopback (op-rbuilder); code comment flags it if backend moves off-host. Per awc 3.8.1, this only affects the idle pool, not in-flight requests.stream_to_client(intra-handler reorder, not a spawn; relies on actix not polling response bodies until the handler returns). Response parse is skipped entirely when neither logging nor mirroring is enabled; metrics path preserved.Unchanged
History
Squashed three explore-then-revert commit pairs (17=>13) for review clarity.