Skip to content
View WaffleBits's full-sized avatar

Highlights

  • Pro

Block or report WaffleBits

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
WaffleBits/README.md

Cleared U.S. cyber operations specialist building secure AI infrastructure, inference reliability tooling, mission decision systems, and performance-sensitive engineering projects.

I am strongest where backend/platform engineering meets high-stakes operations: securing model-serving paths, measuring inference reliability, translating ambiguous workflows into operational software, and building deterministic systems that can be tested under pressure.

Resume website: wafflebits.github.io/WaffleBits

Technical Focus

  • AI compute and inference infrastructure: custom Triton kernels, cache-controlled GPU benchmarks, model-serving gateways, inference benchmarks, Docker/Kubernetes-oriented deployment thinking, Prometheus-compatible artifacts, latency/throughput regression checks, token and GPU-hour capacity modeling, and production ML reliability.
  • Inference runtime performance: Rust continuous batching, paged KV-cache admission, deterministic replay, canary/shadow release gates, fused Triton kernels, PyTorch and torch.compile baselines, workload profiling, tail-latency analysis, and regression harnesses.
  • Hardware/software co-design: ONNX graph ingestion, mixed analog-digital partitioning, explicit accelerator and memory cost models, SystemVerilog control/datapath blocks, FPGA synthesis flows, and model-level quality evaluation with clear simulation-versus-measurement boundaries.
  • Infrastructure security: authentication, authorization, RBAC, rate limits, audit trails, policy enforcement, threat modeling, secure service boundaries, and production extension paths such as OIDC, mTLS, external policy engines, and key management.
  • Forward-deployed / mission engineering: turning ambiguous operational data into working tools for root-cause analysis, what-if planning, observable delivery, and decision support.
  • Quantitative systems engineering: deterministic matching, market microstructure, C++20, latency distributions, cross-language correctness gates, and Linux fundamentals.

Current Stack

  • Languages: Rust, Python, C++20, SystemVerilog, TypeScript, Java; building toward deeper CUDA systems work.
  • Backend/platform: FastAPI, REST APIs, Docker, Linux, CI, service boundaries, testable architecture, and Kubernetes deployment shapes.
  • AI infrastructure: Triton GPU kernels, FP32-accumulating correctness oracles, launch autotuning, cache-cold CUDA-event benchmarking, PyTorch compile comparisons, latency percentiles, token throughput, GPU-hour capacity, cost-to-serve estimates, failure accounting, exact-output batch-invariance checks, Prometheus output, and GPU-aware reliability.
  • Accelerator co-design: ONNX, analytical performance/energy models, analog non-ideality simulation, banked-SRAM traffic analysis, SystemVerilog, Icarus Verilog, Verilator, Yosys, OpenLane configuration, and FPGA schedule execution.
  • Security: access control, policy enforcement, audit logging, rate limiting, public-safe threat modeling, incident response, and secure service design.
  • Product judgment: synthetic operational data modeling, command-facing workflows, explainable recommendations, reviewer-friendly docs, stakeholder translation, and public-safe portfolio discipline.

Role Alignment

  • AI compute and inference infrastructure teams: distributed services, model-serving reliability, Kubernetes-oriented operations, observability, inference benchmarking, performance regression tracking, and hardware-aware debugging.
  • Inference runtime and performance teams: Rust scheduling, paged KV admission, deterministic replay, canary/shadow/rollback validation, Triton kernel development, cache-state control, tail-latency investigation, transparent cost modeling, and native C++ performance measurement.
  • Accelerator and hardware/software co-design teams: compiler partitioning, cost-model assumptions, memory hierarchy analysis, low-precision datapaths, RTL verification, FPGA synthesis, and model-level accuracy tradeoff analysis.
  • Infrastructure security teams: secure access paths, service boundaries, policy enforcement, audit evidence, threat models, incident runbooks, and controls around AI workloads.
  • Forward-deployed AI / government engineering teams: cleared mission context, stakeholder translation, full-stack prototypes, data-backed workflows, observable systems, and delivery under ambiguous requirements.
  • Quantitative systems teams: deterministic execution, market mechanics, Linux fundamentals, C++20, latency measurement, oracle testing, and strong CS fundamentals.

Evidence Map

  • GPU kernel performance: Triton Kernel Lab shows fused RMSNorm and SwiGLU kernels, FP32 oracle validation, shape-aware launch autotuning, cache-cold and cache-hot modes, raw timing samples, p50/p95/p99 tails, torch.compile comparison, and a machine-readable regression gate measured on an RTX 5070 Ti.
  • Hardware/software co-design: HeteroCore connects ONNX compilation, analog non-ideality simulation, SRAM/DRAM traffic modeling, synthesizable SystemVerilog, and FPGA schedule execution through a versioned execution plan. Projected and simulated results are labeled separately from synthesis outputs and physical measurements.
  • Compute / inference infrastructure: Triton-style benchmark work shows concurrency control, latency percentiles, token throughput, requests per GPU-hour, normalized cost-to-serve estimates, retry/failure accounting, exact-output checks across isolated and concurrent execution, Prometheus output, baseline/candidate regression reports, and Kubernetes job posture.
  • Inference runtime engineering: Rust Inference Runtime implements stable priority admission, bounded prefill work, conservative paged KV reservations, round-robin decode scheduling, deterministic trace fingerprints, and baseline/candidate promote, hold, and rollback release decisions.
  • Secure AI platform engineering: Secure GPU Inference Gateway shows authenticated model access, RBAC, reason-for-access policy, audit trails, metrics, SLO notes, incident runbooks, and extension points for OIDC, mTLS, KMS, GPU telemetry, and external policy engines.
  • Forward-deployed mission software: Readiness Control Tower shows public-safe operational data modeling, root-cause scoring, what-if analysis, recommendations, full-stack workflow design, Docker, and tests.
  • Systems / quant fundamentals: Market Microstructure Engine pairs a Python correctness oracle with a dependency-free C++20 core, deterministic parity checks, latency distributions, and measured native throughput.

Featured Work

Correctness-first GPU kernel lab with fused Triton RMSNorm and SwiGLU implementations, FP32 oracles, and controlled comparison with PyTorch eager and torch.compile.

The June 14 RTX 5070 Ti report records 100 warmups and 500 cache-cold samples per case, correctness errors, p50/p95/p99/max latency, environment metadata, and FP16 p50 speedups of 1.25x-2.21x for RMSNorm and 1.07x-2.07x for SwiGLU over the compiled baselines.

Deterministic, accelerator-agnostic runtime core for continuous batching, paged KV-cache admission, replayable scheduling traces, and canary/shadow release validation.

The checked workload completes four synthetic requests in 11 scheduler ticks, peaks at 12 of 20 KV pages, returns all reservations on completion, and emits a stable trace fingerprint. Separate fixtures exercise promote and rollback policy paths through exact output, error-rate, coverage, and p95 latency checks.

Compiler and analytical cost model for mixed analog-digital AI inference, linked to separate analog simulation, memory hierarchy, RTL, and FPGA repositories through a versioned execution plan.

Covers ONNX import, explainable operator placement, peripheral-aware energy sensitivity, model-level quality evaluation, banked-SRAM traffic analysis, synthesizable INT8 datapaths, self-checking simulation, and FPGA synthesis. Analytical projections, simulations, synthesis outputs, and future board measurements are explicitly distinguished.

Synthetic mission readiness platform that fuses sortie, maintenance, supply, personnel, and outage data into a command-facing decision surface.

Covers operational data modeling, FastAPI service design, React/TypeScript workflow design, root-cause scoring, what-if analysis, Docker, tests, and public-safe mission framing.

Distributed inference benchmarking toolkit for Triton-compatible model-serving workflows.

Covers Python load generation, configurable concurrency, retry-aware execution, p50/p95/p99 latency, throughput, success-rate reporting, JSON outputs, and a clean path from mock CI to live inference testing.

Includes Prometheus text export, baseline-versus-candidate regression reporting, batch-invariance probes under concurrent noise traffic, token-throughput and GPU-capacity metrics, explicit accelerator/energy cost assumptions, normalized cost-to-serve estimates, operations notes, and a Kubernetes Job shape for cluster-local benchmark runs.

Security-focused AI infrastructure project for authenticated model access, RBAC, rate limiting, audit logs, policy checks, and observability.

Covers authenticated model access, per-model authorization, reason-for-access enforcement, rate limiting, structured audit logs, Prometheus-compatible metrics, Kubernetes health/scrape posture, SLO notes, incident runbooks, policy checks, tests, and production extension points such as OIDC, mTLS, KMS, GPU telemetry, and external policy engines.

Low-level matching engine and backtesting project for limit-order-book mechanics, deterministic execution, latency measurement, and market simulation.

Covers price-time priority, integer tick prices, partial fills, market orders, cancellations, deterministic snapshots, Python/C++20 parity checks, native edge-case tests, and p50/p95/p99/max latency reporting.

Next Build Priorities

  1. Extend the Triton kernel lab with Nsight Compute counters, roofline analysis, and controlled hardware-counter reports.
  2. Connect the Rust runtime core to a vLLM/SGLang-compatible backend adapter and mirrored-observation format for streaming and tail-latency validation.
  3. Load compiler-generated HeteroCore tiles through a host interface and record physical FPGA timing, utilization, and wall-power measurements.
  4. Add distributed rate limiting, OpenTelemetry export, and Grafana dashboard screenshots to the secure GPU inference gateway.
  5. Extend the Kubernetes, metrics, SLO, rollback, and runbook pattern into the readiness repo.
  6. Add Linux performance-counter capture, cache-aware data-structure comparisons, and replay-style market data ingestion to the C++20 matching engine.

Public-Safe Portfolio Note

All public repositories use synthetic data, mock integrations, or open tooling. I do not publish operational, classified, proprietary, government-furnished, or sensitive customer data.

Pinned Loading

  1. triton-inference-benchmark triton-inference-benchmark Public

    Triton inference benchmark with telemetry, correctness gates, and cost-to-serve modeling

    Python 1

  2. market-microstructure-engine market-microstructure-engine Public

    Deterministic Python and C++20 matching engine with parity and latency benchmarks

    C++

  3. readiness-control-tower readiness-control-tower Public

    Synthetic mission readiness analytics dashboard with FastAPI, React, and public-safe operational data

    TypeScript 1

  4. secure-gpu-inference-gateway secure-gpu-inference-gateway Public

    Security-focused AI inference gateway with RBAC, rate limits, audit logs, and mock GPU backend

    Python

  5. triton-kernel-lab triton-kernel-lab Public

    Correctness-first Triton RMSNorm and SwiGLU kernels with cache-controlled GPU benchmarks and regression gates

    Python

  6. rust-inference-runtime rust-inference-runtime Public

    Deterministic Rust inference scheduler with paged KV admission and release gates

    Rust