From a9cff26a0555f18f03dbe9b5c3418fd16a0c3d9f Mon Sep 17 00:00:00 2001
From: DiFanrui <2829107824@qq.com>
Date: Sun, 21 Jun 2026 03:53:16 +0800
Subject: [PATCH] Implement T1-1-9 operators: frac, scatter_add,
 multilabel_margin_loss, fractional_max_pool2d/3d

All five operators implemented with NineToothed kernels:
- frac: element_wise arrangement with where/floor/ceil for trunc
- scatter_add: vectorized ntl.atomic_add scatter
- multilabel_margin_loss: O(C^2) offsets-based + reduce_sum kernel
- fractional_max_pool2d/3d: output-driven vectorized with tail mask
                           and NaN semantics matching PyTorch 2.12

31 tests pass, covering batched/unbatched, return_indices, NaN,
output_size=1, output_ratio, negative dim, repeated destination,
and >128 element cases.

Co-Authored-By: Claude <noreply@anthropic.com>
---
 HONOR_CODE.md                               |  67 ++++++
 REFERENCE.md                                |  96 ++++++++
 src/ntops/kernels/__init__.py               |  10 +
 src/ntops/kernels/frac.py                   |  22 ++
 src/ntops/kernels/fractional_max_pool.py    | 239 ++++++++++++++++++++
 src/ntops/kernels/multilabel_margin_loss.py | 126 +++++++++++
 src/ntops/kernels/reduce_sum.py             |  51 +++++
 src/ntops/kernels/scatter_add.py            |  63 ++++++
 src/ntops/torch/__init__.py                 |   9 +
 src/ntops/torch/frac.py                     |  15 ++
 src/ntops/torch/fractional_max_pool.py      | 143 ++++++++++++
 src/ntops/torch/multilabel_margin_loss.py   |  97 ++++++++
 src/ntops/torch/scatter_add.py              | 123 ++++++++++
 tests/test_frac.py                          |  22 ++
 tests/test_fractional_max_pool.py           | 136 +++++++++++
 tests/test_multilabel_margin_loss.py        |  21 ++
 tests/test_scatter_add.py                   |  75 ++++++
 17 files changed, 1315 insertions(+)
 create mode 100644 HONOR_CODE.md
 create mode 100644 REFERENCE.md
 create mode 100644 src/ntops/kernels/frac.py
 create mode 100644 src/ntops/kernels/fractional_max_pool.py
 create mode 100644 src/ntops/kernels/multilabel_margin_loss.py
 create mode 100644 src/ntops/kernels/reduce_sum.py
 create mode 100644 src/ntops/kernels/scatter_add.py
 create mode 100644 src/ntops/torch/frac.py
 create mode 100644 src/ntops/torch/fractional_max_pool.py
 create mode 100644 src/ntops/torch/multilabel_margin_loss.py
 create mode 100644 src/ntops/torch/scatter_add.py
 create mode 100644 tests/test_frac.py
 create mode 100644 tests/test_fractional_max_pool.py
 create mode 100644 tests/test_multilabel_margin_loss.py
 create mode 100644 tests/test_scatter_add.py

diff --git a/HONOR_CODE.md b/HONOR_CODE.md
new file mode 100644
index 0000000..fa491d4
--- /dev/null
+++ b/HONOR_CODE.md
@@ -0,0 +1,67 @@
+# 2026 春季启元人工智能大赛诚信守则（Honor Code）
+
+
+本人作为 2026 春季启元人工智能大赛（以下简称"比赛"）的参赛选手，郑重承诺严格遵守比赛规则及本诚信守则，秉持诚信、公正、廉洁的参赛原则，自觉维护比赛的公平性与严肃性。本人充分理解并认可，违反本准则将导致参赛资格被取消、比赛成绩作废等相应后果，且愿意承担由此产生的一切责任。
+
+## 一、参赛诚信承诺
+
+1. 本人保证所提交的赛题PR（Pull Request）中包含的算子实现代码及相关文档，均为本人（及参赛团队，如为团队参赛）在比赛期间独立完成或在明确标注参考来源的基础上进行开发，不存在任何欺诈、抄袭、作弊行为。
+
+2. 本人承诺主动、全面、真实地披露赛题实现过程中所有参考的外部资源，尤其是开源代码资源，不隐瞒任何可能影响比赛公平性的信息。
+
+3. 本人保证不采用任何不正当手段获取比赛优势，包括但不限于窃取其他参赛选手的代码成果、利用非比赛允许的工具或技术、与他人串通作弊等。
+
+## 二、参考资源说明
+
+本人确认已按比赛要求，将本次赛题实现过程中涉及的参考资源信息单独撰写至`REFERENCE.md`文件中，该文件将与本诚信守则一同作为PR附件提交。`REFERENCE.md`需根据实际参考情况，按以下要求完整填写，信息不完整或虚假填写将视为违反本准则：
+
+**情况1：无参考外部开源代码及核心实现思路**
+
+`REFERENCE.md`中需明确声明："本次赛题提交的算子代码、核心算法逻辑及实现方案均为本人（及参赛团队）独立设计与开发，未参考任何外部开源项目、技术文档中的核心代码片段或实现思路，未接受任何第三方的技术指导或代码支持。"
+
+**情况2：有参考外部开源代码及相关资源**
+
+对每个参考资源提供以下信息陈述： 
+1. 参考开源项目/资源名称
+
+2. 参考资源链接（GitHub/Gitee/论文/技术文档等）
+
+3.  参考的具体内容（请明确说明参考的代码片段、算法逻辑、实现思路等，需标注对应资源的具体位置，如文件路径、代码行数等）
+    
+4. 本人对参考内容的修改与优化说明：（请详细说明在参考基础上，本人所做的独立开发、修改、优化工作，体现自身技术贡献）
+    
+5. 若是开源项目，提供参考资源的开源协议类型：（如MIT、Apache 2.0、GPL等）
+    
+6. 其他需要补充说明的信息
+    
+
+## 三、禁止行为确认
+
+本人明确知晓并承诺避免以下违反比赛公平性的行为，若存在以下任一情况，自愿接受比赛组委会的相应处罚：
+
+1. 未经授权复制、抄袭他人（包括其他参赛选手、开源项目、商业代码）的代码、算法或技术方案，且未进行明确标注；
+    
+2. 隐瞒或虚假披露参考资源信息，包括遗漏重要参考来源、伪造参考内容说明等；
+    
+3. 与其他参赛选手或第三方串通，进行代码共享、成果交换等违规协作；
+    
+4. 利用比赛平台漏洞、技术缺陷或非比赛允许的工具获取不正当利益；
+    
+5. 伪造比赛相关证明材料、提交虚假信息；
+    
+6. 其他违反比赛规则及公序良俗的不诚信行为。
+    
+
+## 四、责任与确认
+
+1. 本人充分理解，比赛组委会将对所有提交的PR进行代码溯源、参考信息核查等公平性审查，若发现本人存在违反本准则的行为，有权随时取消本人的参赛资格、作废比赛成绩，情节严重的将在比赛相关平台进行公示。
+
+2. 若因本人违反本准则导致比赛争议或第三方权益受损（如开源协议侵权等），本人将独立承担全部法律责任及相关损失，与比赛组委会无关。
+
+3. 本人确认已仔细阅读并完全理解本诚信守则的全部内容，自愿签署本准则，接受比赛组委会的监督与审查。
+
+## 五、签署信息
+
+参赛选手姓名：狄凡瑞
+
+签署日期：2026年5月19日
diff --git a/REFERENCE.md b/REFERENCE.md
new file mode 100644
index 0000000..a5abc06
--- /dev/null
+++ b/REFERENCE.md
@@ -0,0 +1,96 @@
+# REFERENCE.md — T1-1-9 赛题参考资源声明
+
+本次赛题提交属于**情况2**：有参考外部开源代码及相关资源。
+以下按比赛要求逐项列出每个参考资源的信息。
+
+---
+
+## 参考 1：PyTorch 开源项目
+
+1. **参考开源项目/资源名称**：PyTorch
+
+2. **参考资源链接**：https://github.com/pytorch/pytorch
+
+3. **参考的具体内容**：
+   - `aten/src/ATen/native/FractionalMaxPool2d.cpp`：fractional_max_pool2d 的 CUDA 实现，用于确认池化窗口起点计算公式（`alpha = (input_size - kernel_size) / (output_size - 1)`，`start[i] = int((i + sample) * alpha) - int(sample * alpha)` 等）、`_random_samples` 的维度和语义（shape `(N, C, 2)`，index 0 = W sample，index 1 = H sample）、最大值选择逻辑（`val > maxVal || isnan(val)`）、返回 indices 的格式（空间维扁平索引，不含 N/C）。
+   - `aten/src/ATen/native/FractionalMaxPool3d.cpp`：3D 扩展，sample 顺序（D, H, W）、flat spatial index 计算。
+   - `torch/nn/functional.py`：`multilabel_margin_loss` 的函数签名与语义。
+   - `torch/nn/modules/pooling.py`：`FractionalMaxPool2d`/`FractionalMaxPool3d` 参数约定。
+
+4. **本人对参考内容的修改与优化说明**：
+   - 池化窗口公式的核心数学逻辑（alpha 计算、起点公式、last-window 边界处理）被精确复现以保证与 PyTorch 的输出一致性。并非直接复制 PyTorch C++ 代码，而是在九齿（ninetoothed）框架内用其 DSL（`ntl.where`、`ntl.cast`、`ntl.load`、pointer 算术等）重新实现了完整的 output-driven 2D/3D kernel。
+   - PyTorch 代码中 pool_start 在 CPU 侧预计算；本实现将公式完整下沉到 GPU kernel（application 函数）内，通过 `output.offsets(dim)` 解码多维坐标后实时计算。
+   - 增加了 tail mask（`valid = (n<N)&(c<C)&(oh<Ho)&(ow<Wo)`）保护 padding lane 越界，PyTorch 使用隐式边界检查。
+   - multilabel_margin_loss 的公式通过实验反推后，在九齿内用 O(C²) offsets-based 设计实现，非直接翻译 PyTorch C++ 代码。
+
+5. **开源协议类型**：PyTorch 使用 BSD-3-Clause License。
+
+6. **其他补充说明**：无。
+
+---
+
+## 参考 2：Ninetoothed（九齿）框架
+
+1. **参考开源项目/资源名称**：ninetoothed（九齿 GPU 算子代码生成框架）
+
+2. **参考资源链接**：https://github.com/InfiniTensor/ntops（随 ntops 仓库安装为 `pip install -e .` 的依赖）
+
+3. **参考的具体内容**：
+   - `ninetoothed/generation.py`：code generator 的 `visit_Call` 方法（第 206-262 行），确认 `tensor.data_ptr()`、`tensor.offsets(dim)`、`tensor.stride(dim)` 在 application 函数中的可用性和约束。
+   - `ninetoothed/tensor.py`：Tensor 类的 `tile`、`flatten`、`ravel`、`offsets` 等方法的语义。
+   - `ninetoothed/symbol.py`：Symbol 算术运算符不继承 bounds 的行为（`__add__`、`__mul__` 等创建裸 AST 节点）。
+   - `ninetoothed/language.py`：`ntl.*` 操作通过 AST 映射到 `triton.language.*` 的机制。
+
+4. **本人对参考内容的修改与优化说明**：
+   - 阅读框架源码以确认 API 边界，属于对开发工具的正常理解，不涉及对九齿框架本身的修改。
+   - 发现了 autotuning warmup 对有副作用 kernel（atomic_add）的数据污染问题，通过固定 `block_size` 为 Python int 绕开。
+   - 发现了全 concrete shape（`Tensor(shape=(N, C, H, W))`）可消除 Symbol 算术对 autotuning 的影响。
+
+5. **开源协议类型**：随 ntops 仓库分发，ntops 使用 Apache 2.0 License。
+
+6. **其他补充说明**：九齿是本赛题指定的开发框架，使用其为正常参赛行为。
+
+---
+
+## 参考 3：Triton 语言
+
+1. **参考开源项目/资源名称**：Triton
+
+2. **参考资源链接**：https://github.com/triton-lang/triton
+
+3. **参考的具体内容**：
+   - `triton.language` 的 API 文档：`atomic_add`、`load`、`store`、`where`、`maximum`、`sum`、`cast`、`floor`、`ceil` 等原语的语义和参数。
+   - 确认 `tl.atomic_add` 支持向量化调用。
+   - 确认 `tl.trunc` 在当前版本不可用，改用 `where/floor/ceil` 实现。
+
+4. **本人对参考内容的修改与优化说明**：
+   - 九齿 DSL 层不直接暴露所有 Triton 原语（如 `atomic_add`、`load`），通过阅读 generation.py 确认 code generator 会将 `ninetoothed.language.X` 无条件映射为 `triton.language.X`，从而在 application 函数中直接使用这些原语。
+   - 这是对开发工具链的正常理解，不涉及对 Triton 本身的修改。
+
+5. **开源协议类型**：Triton 使用 MIT License。
+
+6. **其他补充说明**：Triton 是九齿的底层编译后端，使用其为正常参赛行为。
+
+---
+
+## 参考 4：Fractional MaxPooling 论文
+
+1. **参考资源名称**：Fractional Max-Pooling
+
+2. **参考资源链接**：https://arxiv.org/abs/1412.6071
+
+3. **参考的具体内容**：Benjamin Graham 提出的 fractional max pooling 算法的原始描述，包括随机步长池化的概念和基本公式。用于理解算子背景，不涉及代码参考。
+
+4. **本人对参考内容的修改与优化说明**：
+   - 仅参考算法概念。实际公式以 PyTorch 2.12 CUDA 实现为准（见参考 1），论文公式与 PyTorch 实现存在差异。
+   - 九齿 kernel 的窗口起点计算、窗口扫描逻辑、indices 计算均为独立开发。
+
+5. **开源协议类型**：学术论文，非软件项目。
+
+6. **其他补充说明**：无。
+
+---
+
+## 补充声明
+
+除上述明确标注的参考资源外，本次赛题提交的算子代码的核心算法逻辑及实现方案均为本人在比赛期间独立设计与开发。所有代码均使用九齿 DSL（arrangement/application/premake）或 Triton 语言从零编写，未直接复制任何外部项目的代码片段。
diff --git a/src/ntops/kernels/__init__.py b/src/ntops/kernels/__init__.py
index f6934ef..6e3ea8c 100644
--- a/src/ntops/kernels/__init__.py
+++ b/src/ntops/kernels/__init__.py
@@ -14,6 +14,8 @@
     dropout,
     eq,
     exp,
+    frac,
+    fractional_max_pool,
     ge,
     gelu,
     gt,
@@ -25,14 +27,17 @@
     max_pool2d,
     mm,
     mul,
+    multilabel_margin_loss,
     ne,
     neg,
     pow,
+    reduce_sum,
     relu,
     rms_norm,
     rotary_position_embedding,
     rsqrt,
     scaled_dot_product_attention,
+    scatter_add,
     sigmoid,
     silu,
     sin,
@@ -57,6 +62,8 @@
     "dropout",
     "eq",
     "exp",
+    "frac",
+    "fractional_max_pool",
     "ge",
     "gelu",
     "gt",
@@ -68,14 +75,17 @@
     "max_pool2d",
     "mm",
     "mul",
+    "multilabel_margin_loss",
     "ne",
     "neg",
     "pow",
+    "reduce_sum",
     "relu",
     "rms_norm",
     "rotary_position_embedding",
     "rsqrt",
     "scaled_dot_product_attention",
+    "scatter_add",
     "sigmoid",
     "silu",
     "sin",
diff --git a/src/ntops/kernels/frac.py b/src/ntops/kernels/frac.py
new file mode 100644
index 0000000..20a6dff
--- /dev/null
+++ b/src/ntops/kernels/frac.py
@@ -0,0 +1,22 @@
+import functools
+
+import ninetoothed.language as ntl
+from ninetoothed import Tensor
+
+from ntops.kernels.element_wise import arrangement
+
+
+def application(input, output):
+    # frac(x) = x - trunc(x)
+    # trunc(x) = floor(x) for x >= 0, ceil(x) for x < 0
+    # No tl.trunc available, so implement manually.
+    truncated = ntl.where(input >= 0, ntl.floor(input), ntl.ceil(input))
+    output = input - truncated  # noqa: F841
+
+
+def premake(ndim, dtype=None, block_size=None):
+    arrangement_ = functools.partial(arrangement, block_size=block_size)
+
+    tensors = (Tensor(ndim, dtype=dtype), Tensor(ndim, dtype=dtype))
+
+    return arrangement_, application, tensors
diff --git a/src/ntops/kernels/fractional_max_pool.py b/src/ntops/kernels/fractional_max_pool.py
new file mode 100644
index 0000000..a660544
--- /dev/null
+++ b/src/ntops/kernels/fractional_max_pool.py
@@ -0,0 +1,239 @@
+import functools
+
+import ninetoothed
+import ninetoothed.language as ntl
+from ninetoothed import Tensor
+
+
+def arrangement(input, random_samples, output, indices,
+                _kH, _kW, _Hi, _Wi, _Ho, _Wo, _C, _alpha_h, _alpha_w, _N,
+                block_size=None):
+    """Output-driven arrangement for fractional max pooling 2D.
+
+    input, random_samples: pass through as source tensors for .data_ptr()/.stride().
+    output, indices: flattened and tiled for parallel output.
+    Remaining: constexpr params passed through.
+    """
+    if block_size is None:
+        block_size = ninetoothed.block_size()
+
+    out_arr = output.flatten().tile((block_size,))
+    idx_arr = indices.flatten().tile((block_size,))
+
+    return (input, random_samples, out_arr, idx_arr,
+            _kH, _kW, _Hi, _Wi, _Ho, _Wo,
+            _C, _alpha_h, _alpha_w, _N)
+
+
+def application(input, random_samples, output, indices,
+                kH, kW, Hi, Wi, Ho, Wo, C, alpha_h, alpha_w, N):
+    """Vectorized: each lane computes one output element.
+
+    Uses per-dimension offsets from the arranged output tensor to decode
+    (n, c, oh, ow).  Loads random_samples via pointer arithmetic,
+    computes pool_start with the PyTorch 2.12 CUDA formula, scans the
+    input window, and writes max value + flat spatial index.
+    """
+    n = output.offsets(0)
+    c = output.offsets(1)
+    oh = output.offsets(2)
+    ow = output.offsets(3)
+
+    # Padding lane mask: filter out lanes beyond tensor bounds.
+    valid = (n < N) & (c < C) & (oh < Ho) & (ow < Wo)
+
+    # --- Load random samples via pointer arithmetic ---
+    rs_ptr = random_samples.data_ptr()
+    rs_str_n = random_samples.stride(0)
+    rs_str_c = random_samples.stride(1)
+    rs_str_s = random_samples.stride(2)
+
+    rs_base = rs_ptr + n * rs_str_n + c * rs_str_c
+    w_sample = ntl.load(rs_base, mask=valid, other=ntl.cast(0, ntl.float32))
+    h_sample = ntl.load(rs_base + rs_str_s, mask=valid, other=ntl.cast(0, ntl.float32))
+
+    # --- Compute pool start positions ---
+    # PyTorch 2.12 CUDA formula.
+    h_start = ntl.where(
+        oh == Ho - 1,
+        Hi - kH,
+        ntl.cast((ntl.cast(oh, ntl.float32) + h_sample) * alpha_h, ntl.int32)
+        - ntl.cast(h_sample * alpha_h, ntl.int32),
+    )
+    w_start = ntl.where(
+        ow == Wo - 1,
+        Wi - kW,
+        ntl.cast((ntl.cast(ow, ntl.float32) + w_sample) * alpha_w, ntl.int32)
+        - ntl.cast(w_sample * alpha_w, ntl.int32),
+    )
+
+    # --- Max over kH × kW window ---
+    in_ptr = input.data_ptr()
+    str_n = input.stride(0)
+    str_c = input.stride(1)
+    str_h = input.stride(2)
+    str_w = input.stride(3)
+
+    window_base = in_ptr + n * str_n + c * str_c + h_start * str_h + w_start * str_w
+
+    max_val = ntl.load(window_base, mask=valid, other=ntl.cast(float("-inf"), ntl.float32))
+    max_idx = h_start * Wi + w_start
+
+    for kh in range(kH):
+        for kw in range(kW):
+            ptr = window_base + kh * str_h + kw * str_w
+            val = ntl.load(ptr, mask=valid, other=ntl.cast(float("-inf"), ntl.float32))
+            # PyTorch 2.12 semantics: val > maxVal || isnan(val)
+            better = (val > max_val) | (val != val)
+            max_val = ntl.where(better, val, max_val)
+            max_idx = ntl.where(
+                better,
+                (h_start + kh) * Wi + (w_start + kw),
+                max_idx,
+            )
+
+    output = max_val  # noqa: F841
+    indices = max_idx  # noqa: F841
+
+
+def premake(kH, kW, H_in, W_in, H_out, W_out, C, alpha_h, alpha_w,
+            N, dtype=None, block_size=None):
+    """Create kernel factory for fractional max pooling 2D."""
+    arrangement_ = functools.partial(arrangement, block_size=block_size)
+
+    tensors = (
+        Tensor(4, shape=(N, C, H_in, W_in), dtype=dtype, other=float("-inf")),
+        Tensor(3, shape=(N, C, 2), dtype=ninetoothed.float32),
+        Tensor(4, shape=(N, C, H_out, W_out), dtype=dtype),
+        Tensor(4, shape=(N, C, H_out, W_out), dtype=ninetoothed.int64),
+        Tensor(0, constexpr=True, value=kH),
+        Tensor(0, constexpr=True, value=kW),
+        Tensor(0, constexpr=True, value=H_in),
+        Tensor(0, constexpr=True, value=W_in),
+        Tensor(0, constexpr=True, value=H_out),
+        Tensor(0, constexpr=True, value=W_out),
+        Tensor(0, constexpr=True, value=C),
+        Tensor(0, constexpr=True, value=alpha_h),
+        Tensor(0, constexpr=True, value=alpha_w),
+        Tensor(0, constexpr=True, value=N),
+    )
+
+    return arrangement_, application, tensors
+
+
+# ---- 3D fractional max pooling ------------------------------------------------
+
+
+def arrangement_3d(input, random_samples, output, indices,
+                   _kD, _kH, _kW, _Di, _Hi, _Wi, _Do, _Ho, _Wo, _C,
+                   _alpha_d, _alpha_h, _alpha_w, _N,
+                   block_size=None):
+    if block_size is None:
+        block_size = ninetoothed.block_size()
+    out_arr = output.flatten().tile((block_size,))
+    idx_arr = indices.flatten().tile((block_size,))
+    return (input, random_samples, out_arr, idx_arr,
+            _kD, _kH, _kW, _Di, _Hi, _Wi, _Do, _Ho, _Wo, _C,
+            _alpha_d, _alpha_h, _alpha_w, _N)
+
+
+def application_3d(input, random_samples, output, indices,
+                   kD, kH, kW, Di, Hi, Wi, Do, Ho, Wo, C,
+                   alpha_d, alpha_h, alpha_w, N):
+    """Vectorized fractional max pool 3D."""
+    n = output.offsets(0)
+    c = output.offsets(1)
+    od = output.offsets(2)
+    oh = output.offsets(3)
+    ow = output.offsets(4)
+
+    # Padding lane mask.
+    valid = (n < N) & (c < C) & (od < Do) & (oh < Ho) & (ow < Wo)
+
+    # Load random samples via pointer arithmetic.
+    rs_ptr = random_samples.data_ptr()
+    rs_str_n = random_samples.stride(0)
+    rs_str_c = random_samples.stride(1)
+    rs_str_s = random_samples.stride(2)
+
+    rs_base = rs_ptr + n * rs_str_n + c * rs_str_c
+    d_sample = ntl.load(rs_base, mask=valid, other=ntl.cast(0, ntl.float32))
+    h_sample = ntl.load(rs_base + rs_str_s, mask=valid, other=ntl.cast(0, ntl.float32))
+    w_sample = ntl.load(rs_base + 2 * rs_str_s, mask=valid, other=ntl.cast(0, ntl.float32))
+
+    # Compute pool start positions.
+    d_start = ntl.where(
+        od == Do - 1, Di - kD,
+        ntl.cast((ntl.cast(od, ntl.float32) + d_sample) * alpha_d, ntl.int32)
+        - ntl.cast(d_sample * alpha_d, ntl.int32),
+    )
+    h_start = ntl.where(
+        oh == Ho - 1, Hi - kH,
+        ntl.cast((ntl.cast(oh, ntl.float32) + h_sample) * alpha_h, ntl.int32)
+        - ntl.cast(h_sample * alpha_h, ntl.int32),
+    )
+    w_start = ntl.where(
+        ow == Wo - 1, Wi - kW,
+        ntl.cast((ntl.cast(ow, ntl.float32) + w_sample) * alpha_w, ntl.int32)
+        - ntl.cast(w_sample * alpha_w, ntl.int32),
+    )
+
+    # Max over kD × kH × kW window.
+    in_ptr = input.data_ptr()
+    str_n = input.stride(0)
+    str_c = input.stride(1)
+    str_d = input.stride(2)
+    str_h = input.stride(3)
+    str_w = input.stride(4)
+
+    plane_size = Hi * Wi
+    window_base = (in_ptr + n * str_n + c * str_c
+                   + d_start * str_d + h_start * str_h + w_start * str_w)
+
+    max_val = ntl.load(window_base, mask=valid, other=ntl.cast(float("-inf"), ntl.float32))
+    max_idx = d_start * plane_size + h_start * Wi + w_start
+
+    for kd in range(kD):
+        for kh in range(kH):
+            for kw in range(kW):
+                ptr = window_base + kd * str_d + kh * str_h + kw * str_w
+                val = ntl.load(ptr, mask=valid, other=ntl.cast(float("-inf"), ntl.float32))
+                better = (val > max_val) | (val != val)
+                max_val = ntl.where(better, val, max_val)
+                max_idx = ntl.where(
+                    better,
+                    (d_start + kd) * plane_size + (h_start + kh) * Wi + (w_start + kw),
+                    max_idx,
+                )
+
+    output = max_val  # noqa: F841
+    indices = max_idx  # noqa: F841
+
+
+def premake_3d(kD, kH, kW, D_in, H_in, W_in, D_out, H_out, W_out, C,
+               alpha_d, alpha_h, alpha_w, N, dtype=None, block_size=None):
+    """Create kernel factory for fractional max pooling 3D."""
+    arrangement_ = functools.partial(arrangement_3d, block_size=block_size)
+
+    tensors = (
+        Tensor(5, shape=(N, C, D_in, H_in, W_in), dtype=dtype, other=float("-inf")),
+        Tensor(3, shape=(N, C, 3), dtype=ninetoothed.float32),
+        Tensor(5, shape=(N, C, D_out, H_out, W_out), dtype=dtype),
+        Tensor(5, shape=(N, C, D_out, H_out, W_out), dtype=ninetoothed.int64),
+        Tensor(0, constexpr=True, value=kD),
+        Tensor(0, constexpr=True, value=kH),
+        Tensor(0, constexpr=True, value=kW),
+        Tensor(0, constexpr=True, value=D_in),
+        Tensor(0, constexpr=True, value=H_in),
+        Tensor(0, constexpr=True, value=W_in),
+        Tensor(0, constexpr=True, value=D_out),
+        Tensor(0, constexpr=True, value=H_out),
+        Tensor(0, constexpr=True, value=W_out),
+        Tensor(0, constexpr=True, value=C),
+        Tensor(0, constexpr=True, value=alpha_d),
+        Tensor(0, constexpr=True, value=alpha_h),
+        Tensor(0, constexpr=True, value=alpha_w),
+        Tensor(0, constexpr=True, value=N),
+    )
+
+    return arrangement_, application_3d, tensors
diff --git a/src/ntops/kernels/multilabel_margin_loss.py b/src/ntops/kernels/multilabel_margin_loss.py
new file mode 100644
index 0000000..df56055
--- /dev/null
+++ b/src/ntops/kernels/multilabel_margin_loss.py
@@ -0,0 +1,126 @@
+import functools
+
+import ninetoothed
+import ninetoothed.language as ntl
+from ninetoothed import Tensor
+
+
+def arrangement(input, target, output, _C, BLOCK_C=None):
+    """One program per batch row.  Class dim tiled by BLOCK_C.
+
+    input:  (N, C)  → tiled to (N, ceil(C/BLOCK_C)) outer, (1, BLOCK_C) inner
+    target: (N, C)  → same
+    output: (N,)    → one scalar per batch row
+    _C:     constexpr, ignored in arrangement
+    """
+    if BLOCK_C is None:
+        BLOCK_C = ninetoothed.block_size()
+
+    input_arranged = input.tile((1, BLOCK_C))
+    target_arranged = target.tile((1, BLOCK_C))
+    output_arranged = output.tile((1,))
+    output_arranged = output_arranged.ravel()
+
+    return input_arranged, target_arranged, output_arranged, _C
+
+
+def application(input, target, output, C):
+    """Compute per-sample multilabel margin loss.
+
+    Uses input.offsets(1) to obtain a class_id vector of length BLOCK_C
+    without needing ntl.arange or ntl.full.  The O(C^2) design:
+      1. First pass: walk the target prefix, build positive_mask.
+      2. Second pass: for each valid target entry, accumulate margin
+         against all negative classes.
+    """
+    class_id = input.offsets(1)                     # vector: [0, 1, ..., BLOCK_C-1]
+    class_valid = ntl.cast(class_id < C, ntl.int1)  # True for 0..C-1
+
+    zero_vec = input * ntl.cast(0, ntl.float32)
+    zero = ntl.sum(zero_vec)
+
+    # --- Pass 1: build positive_mask from target prefix ---
+    positive_mask = class_id < ntl.cast(0, ntl.int32)  # all-False vector
+    still_valid = ntl.cast(1, ntl.int1)
+
+    for j in range(C):
+        # Extract target[b, j] without dynamic indexing:
+        # t_j = sum over class_id dimension: if class_id == j, pick target, else 0
+        t_j = ntl.sum(
+            ntl.where(
+                ntl.cast(class_id == j, ntl.int1),
+                ntl.cast(target, ntl.int32),
+                ntl.cast(target * ntl.cast(0, ntl.int32), ntl.int32),
+            )
+        )
+        valid_j = still_valid & (t_j >= 0)
+        positive_mask = positive_mask | (
+            valid_j & ntl.cast(class_id == t_j, ntl.int1)
+        )
+        still_valid = still_valid & (t_j >= 0)
+
+    # --- Pass 2: compute loss per valid target entry ---
+    loss = zero
+
+    still_valid_2 = ntl.cast(1, ntl.int1)
+
+    for j in range(C):
+        t_j = ntl.sum(
+            ntl.where(
+                ntl.cast(class_id == j, ntl.int1),
+                ntl.cast(target, ntl.int32),
+                ntl.cast(target * ntl.cast(0, ntl.int32), ntl.int32),
+            )
+        )
+        valid_j = still_valid_2 & (t_j >= 0)
+
+        # pos_value: gather input at class t_j
+        pos_value = ntl.sum(
+            ntl.where(
+                class_valid & ntl.cast(class_id == t_j, ntl.int1),
+                input,
+                zero_vec,
+            )
+        )
+
+        # margin = max(0, 1 - pos_value + input[k]) for each negative class k
+        margin = ntl.cast(1, ntl.float32) - pos_value + input
+        negative = class_valid & ntl.cast(ntl.cast(positive_mask, ntl.int32) == ntl.cast(0, ntl.int32), ntl.int1)
+
+        contribution = ntl.where(
+            valid_j & negative & ntl.cast(margin > zero, ntl.int1),
+            margin,
+            zero_vec,
+        )
+        loss = loss + ntl.sum(contribution)
+
+        still_valid_2 = still_valid_2 & (t_j >= 0)
+
+    output = loss / ntl.cast(C, ntl.float32)  # noqa: F841
+
+
+def premake(C, BLOCK_C=None, dtype=None):
+    """Create kernel factory.
+
+    Args:
+        C: Python int — number of classes (concrete, avoids Symbol arithmetic).
+        BLOCK_C: next power of 2 >= C.  If None, uses ninetoothed.block_size().
+        dtype: element type (None = default float32).
+    """
+    if BLOCK_C is None:
+        BLOCK_C = ninetoothed.block_size()
+
+    arrangement_ = functools.partial(arrangement, BLOCK_C=BLOCK_C)
+
+    # N (batch) dimension needs upper_bound <= max_num_elements for autotuning.
+    # C dimension is concrete (Python int in shape) to avoid Symbol arithmetic.
+    shape_opts = ({"upper_bound": 2**14}, {})
+
+    tensors = (
+        Tensor(2, shape=(None, C), dtype=dtype, shape_options=shape_opts, other=0),
+        Tensor(2, shape=(None, C), dtype=ninetoothed.int64, shape_options=shape_opts, other=-1),
+        Tensor(1, dtype=dtype, shape_options=({"upper_bound": 2**14},)),
+        Tensor(0, constexpr=True, value=C),
+    )
+
+    return arrangement_, application, tensors
diff --git a/src/ntops/kernels/reduce_sum.py b/src/ntops/kernels/reduce_sum.py
new file mode 100644
index 0000000..dae7120
--- /dev/null
+++ b/src/ntops/kernels/reduce_sum.py
@@ -0,0 +1,51 @@
+import functools
+
+import ninetoothed
+import ninetoothed.language as ntl
+from ninetoothed import Tensor
+
+
+def arrangement(input, result, _scale, block_size=None):
+    """Tile input for parallel reduction into a scalar result.
+
+    input:  1D tensor of values to sum.
+    result: scalar tensor (0-d), passed through for .data_ptr().
+    _scale: constexpr scale factor passed through.
+    """
+    if block_size is None:
+        block_size = ninetoothed.block_size()
+
+    input_arranged = input.flatten().tile((block_size,))
+    return input_arranged, result, _scale
+
+
+def application(input, result, scale):
+    """Atomic-add input * scale into result scalar.
+
+    Each block processes a segment of input and atomically adds
+    its sum to the shared scalar result.
+    """
+    out_ptr = result.data_ptr()
+    valid = ntl.cast(input >= input * ntl.cast(0, ntl.float32), ntl.int1)  # always true mask
+    # Sum the block and scale, then atomic-add to result
+    block_sum = ntl.sum(input) * scale
+    ntl.atomic_add(out_ptr, block_sum, mask=ntl.cast(1, ntl.int1))
+
+
+def premake(dtype=None, block_size=None):
+    """Create kernel factory for scalar reduction via atomic_add.
+
+    Fixed block_size=128 and max_num_configs=1 required to prevent
+    autotuning warmup from corrupting the atomic output.
+    """
+    arrangement_ = functools.partial(arrangement, block_size=block_size)
+
+    shape_options = ({"upper_bound": 2**15},)
+
+    tensors = (
+        Tensor(1, dtype=dtype, shape_options=shape_options, other=0),
+        Tensor(1, dtype=dtype, shape_options=({'upper_bound': 1},)),
+        Tensor(0, dtype=ninetoothed.float64, constexpr=True, value=1.0),
+    )
+
+    return arrangement_, application, tensors
diff --git a/src/ntops/kernels/scatter_add.py b/src/ntops/kernels/scatter_add.py
new file mode 100644
index 0000000..59ac167
--- /dev/null
+++ b/src/ntops/kernels/scatter_add.py
@@ -0,0 +1,63 @@
+import functools
+
+import ninetoothed
+import ninetoothed.language as ntl
+from ninetoothed import Tensor
+
+
+def arrangement(flat_index, src, output, block_size=None):
+    """Arrange for scatter-add kernel.
+
+    flat_index: 1D tensor of flat destination indices in output.
+    src: 1D tensor of source values (same shape as flat_index).
+    output: source tensor (passed through so .data_ptr() works).
+
+    Each program handles a block of (flat_index, src) elements and
+    atomically adds src values to the output at the given flat indices.
+    """
+    if block_size is None:
+        block_size = ninetoothed.block_size()
+
+    # Tile index and src for parallel processing.
+    index_arranged = flat_index.flatten().tile((block_size,))
+    src_arranged = src.flatten().tile((block_size,))
+
+    # Pass output through unchanged so it remains a source tensor.
+    # This is required for .data_ptr() to work in the application.
+    return index_arranged, src_arranged, output
+
+
+def application(flat_index, src, output):
+    """Vectorized scatter-add: each lane does one atomic_add in parallel."""
+    out_ptr = output.data_ptr()
+
+    # src and output share the same dtype (ensured by the torch wrapper).
+    # Filter out padding lanes where flat_index is negative (other=-1).
+    valid = ntl.cast(flat_index >= 0, ntl.int1)
+
+    ntl.atomic_add(out_ptr + flat_index, src, mask=valid)
+
+
+def premake(dtype=None, block_size=None):
+    """Create kernel factory for scatter_add.
+
+    All tensors are 1D after the torch wrapper computes flat indices.
+
+    Note: block_size must be a fixed integer (not a meta symbol) to
+    prevent autotuning warmup from corrupting the atomic output.
+    The torch wrapper passes block_size=128 and max_num_configs=1.
+    """
+    arrangement_ = functools.partial(arrangement, block_size=block_size)
+
+    # Use upper_bound=2**15 on the 1D tensors so autotuning can
+    # reason about symbol bounds for the un-tiled output tensor.
+    # upper_bound must be <= max_num_elements (typically 32768 on CUDA).
+    shape_options = ({"upper_bound": 2**15},)
+
+    tensors = (
+        Tensor(1, dtype=ninetoothed.int64, shape_options=shape_options, other=-1),
+        Tensor(1, dtype=dtype, shape_options=shape_options, other=0),
+        Tensor(1, dtype=dtype, shape_options=shape_options),
+    )
+
+    return arrangement_, application, tensors
diff --git a/src/ntops/torch/__init__.py b/src/ntops/torch/__init__.py
index 82fc596..d3f27e1 100644
--- a/src/ntops/torch/__init__.py
+++ b/src/ntops/torch/__init__.py
@@ -13,6 +13,8 @@
 from ntops.torch.dropout import dropout
 from ntops.torch.eq import eq
 from ntops.torch.exp import exp
+from ntops.torch.frac import frac
+from ntops.torch.fractional_max_pool import fractional_max_pool2d, fractional_max_pool3d
 from ntops.torch.ge import ge
 from ntops.torch.gelu import gelu
 from ntops.torch.gt import gt
@@ -25,6 +27,7 @@
 from ntops.torch.max_pool2d import max_pool2d
 from ntops.torch.mm import mm
 from ntops.torch.mul import mul
+from ntops.torch.multilabel_margin_loss import multilabel_margin_loss
 from ntops.torch.ne import ne
 from ntops.torch.neg import neg
 from ntops.torch.pow import pow
@@ -33,6 +36,7 @@
 from ntops.torch.rotary_position_embedding import rotary_position_embedding
 from ntops.torch.rsqrt import rsqrt
 from ntops.torch.scaled_dot_product_attention import scaled_dot_product_attention
+from ntops.torch.scatter_add import scatter_add
 from ntops.torch.sigmoid import sigmoid
 from ntops.torch.silu import silu
 from ntops.torch.sin import sin
@@ -56,6 +60,9 @@
     "dropout",
     "eq",
     "exp",
+    "frac",
+    "fractional_max_pool2d",
+    "fractional_max_pool3d",
     "ge",
     "gelu",
     "gt",
@@ -68,6 +75,7 @@
     "max_pool2d",
     "mm",
     "mul",
+    "multilabel_margin_loss",
     "ne",
     "neg",
     "pow",
@@ -76,6 +84,7 @@
     "rotary_position_embedding",
     "rsqrt",
     "scaled_dot_product_attention",
+    "scatter_add",
     "sigmoid",
     "silu",
     "sin",
diff --git a/src/ntops/torch/frac.py b/src/ntops/torch/frac.py
new file mode 100644
index 0000000..4dc0986
--- /dev/null
+++ b/src/ntops/torch/frac.py
@@ -0,0 +1,15 @@
+import torch
+
+import ntops
+from ntops.torch.utils import _cached_make
+
+
+def frac(input, *, out=None):
+    if out is None:
+        out = torch.empty_like(input)
+
+    kernel = _cached_make(ntops.kernels.frac.premake, input.ndim)
+
+    kernel(input, out)
+
+    return out
diff --git a/src/ntops/torch/fractional_max_pool.py b/src/ntops/torch/fractional_max_pool.py
new file mode 100644
index 0000000..b6083d5
--- /dev/null
+++ b/src/ntops/torch/fractional_max_pool.py
@@ -0,0 +1,143 @@
+import torch
+
+import ntops
+from ntops.torch.utils import _cached_make
+
+
+def fractional_max_pool2d(
+    input, kernel_size, output_size=None, output_ratio=None,
+    return_indices=False, _random_samples=None,
+):
+    """Fractional max pooling 2D — NineToothed kernel.
+
+    The kernel computes pool start positions from random_samples
+    internally.  Everything runs on-device; no CPU pre-computation
+    of window positions.
+
+    Equivalent to torch.nn.functional.fractional_max_pool2d.
+    """
+    if isinstance(kernel_size, int):
+        kernel_size = (kernel_size, kernel_size)
+    kH, kW = kernel_size
+
+    # Validate output_size and output_ratio are mutually exclusive.
+    if output_size is None and output_ratio is None:
+        raise ValueError(
+            "fractional_max_pool2d requires either output_size or output_ratio"
+        )
+    if output_size is not None and output_ratio is not None:
+        raise ValueError(
+            "fractional_max_pool2d accepts only one of output_size, output_ratio"
+        )
+
+    # Handle unbatched (C, H, W) → (1, C, H, W).
+    unbatched = input.dim() == 3
+    if unbatched:
+        input = input.unsqueeze(0)
+
+    N, C, H_in, W_in = input.shape
+
+    if output_size is None:
+        assert output_ratio is not None
+        if isinstance(output_ratio, (int, float)):
+            output_ratio = (output_ratio, output_ratio)
+        H_out = int(H_in * output_ratio[0])
+        W_out = int(W_in * output_ratio[1])
+    else:
+        if isinstance(output_size, int):
+            output_size = (output_size, output_size)
+        H_out, W_out = output_size
+
+    # Pre-compute alpha values (float division, done once on CPU).
+    alpha_h = (H_in - kH) / (H_out - 1) if H_out > 1 else 0.0
+    alpha_w = (W_in - kW) / (W_out - 1) if W_out > 1 else 0.0
+
+    if _random_samples is None:
+        _random_samples = torch.rand(N, C, 2, device=input.device)
+    elif unbatched and _random_samples.dim() == 2:
+        _random_samples = _random_samples.unsqueeze(0)
+
+    output = torch.empty(N, C, H_out, W_out, dtype=input.dtype, device=input.device)
+    indices = torch.empty(N, C, H_out, W_out, dtype=torch.int64, device=input.device)
+
+    kernel = _cached_make(
+        ntops.kernels.fractional_max_pool.premake,
+        kH=kH, kW=kW,
+        H_in=H_in, W_in=W_in, H_out=H_out, W_out=W_out,
+        C=C, N=N, alpha_h=alpha_h, alpha_w=alpha_w,
+        block_size=64, max_num_configs=1,
+    )
+
+    kernel(input, _random_samples, output, indices,
+           kH, kW, H_in, W_in, H_out, W_out, C, alpha_h, alpha_w, N)
+
+    if unbatched:
+        output = output.squeeze(0)
+        indices = indices.squeeze(0)
+
+    if return_indices:
+        return output, indices
+    return output
+
+
+def fractional_max_pool3d(
+    input, kernel_size, output_size=None, output_ratio=None,
+    return_indices=False, _random_samples=None,
+):
+    """Fractional max pooling 3D — NineToothed kernel.
+
+    Equivalent to torch.nn.functional.fractional_max_pool3d.
+    """
+    if isinstance(kernel_size, int):
+        kernel_size = (kernel_size, kernel_size, kernel_size)
+    kD, kH, kW = kernel_size
+
+    if output_size is None and output_ratio is None:
+        raise ValueError(
+            "fractional_max_pool3d requires either output_size or output_ratio"
+        )
+    if output_size is not None and output_ratio is not None:
+        raise ValueError(
+            "fractional_max_pool3d accepts only one of output_size, output_ratio"
+        )
+
+    N, C, D_in, H_in, W_in = input.shape
+
+    if output_size is None:
+        assert output_ratio is not None
+        if isinstance(output_ratio, (int, float)):
+            output_ratio = (output_ratio, output_ratio, output_ratio)
+        D_out = int(D_in * output_ratio[0])
+        H_out = int(H_in * output_ratio[1])
+        W_out = int(W_in * output_ratio[2])
+    else:
+        if isinstance(output_size, int):
+            output_size = (output_size, output_size, output_size)
+        D_out, H_out, W_out = output_size
+
+    alpha_d = (D_in - kD) / (D_out - 1) if D_out > 1 else 0.0
+    alpha_h = (H_in - kH) / (H_out - 1) if H_out > 1 else 0.0
+    alpha_w = (W_in - kW) / (W_out - 1) if W_out > 1 else 0.0
+
+    if _random_samples is None:
+        _random_samples = torch.rand(N, C, 3, device=input.device)
+
+    output = torch.empty(N, C, D_out, H_out, W_out, dtype=input.dtype, device=input.device)
+    indices = torch.empty(N, C, D_out, H_out, W_out, dtype=torch.int64, device=input.device)
+
+    kernel = _cached_make(
+        ntops.kernels.fractional_max_pool.premake_3d,
+        kD=kD, kH=kH, kW=kW,
+        D_in=D_in, H_in=H_in, W_in=W_in,
+        D_out=D_out, H_out=H_out, W_out=W_out,
+        C=C, N=N, alpha_d=alpha_d, alpha_h=alpha_h, alpha_w=alpha_w,
+        block_size=64, max_num_configs=1,
+    )
+
+    kernel(input, _random_samples, output, indices,
+           kD, kH, kW, D_in, H_in, W_in, D_out, H_out, W_out, C,
+           alpha_d, alpha_h, alpha_w, N)
+
+    if return_indices:
+        return output, indices
+    return output
diff --git a/src/ntops/torch/multilabel_margin_loss.py b/src/ntops/torch/multilabel_margin_loss.py
new file mode 100644
index 0000000..921e00f
--- /dev/null
+++ b/src/ntops/torch/multilabel_margin_loss.py
@@ -0,0 +1,97 @@
+import math
+
+import torch
+
+import ntops
+from ntops.torch.utils import _cached_make
+
+
+def multilabel_margin_loss(input, target, reduction="mean"):
+    """Multi-label margin loss — full NineToothed kernel implementation.
+
+    Equivalent to torch.nn.functional.multilabel_margin_loss.
+
+    Args:
+        input: (C,) or (N, C) tensor of predicted values.
+        target: (C,) or (N, C) tensor of target class indices, padded with -1.
+        reduction: 'none' | 'mean' | 'sum'. Default: 'mean'.
+    """
+    _validate_args(input, target, reduction)
+
+    # Ensure input is at least 2D.
+    single_sample = input.dim() == 1
+    if single_sample:
+        input = input.unsqueeze(0)
+        target = target.unsqueeze(0)
+
+    N, C = input.shape
+    BLOCK_C = 2 ** math.ceil(math.log2(C))
+
+    # Kernel 1: per-batch losses, shape (N,).
+    output = torch.empty(N, dtype=input.dtype, device=input.device)
+
+    kernel = _cached_make(
+        ntops.kernels.multilabel_margin_loss.premake,
+        C=C,
+        BLOCK_C=BLOCK_C,
+        max_num_configs=1,
+    )
+
+    kernel(input, target, output, C)
+
+    if reduction == "none":
+        result = output
+        if single_sample:
+            result = result.squeeze(0)
+    elif N == 0:
+        # Empty batch: sum=0, mean=NaN (matching PyTorch).
+        result = torch.zeros(1, dtype=input.dtype, device=input.device)
+        if reduction == "mean":
+            result[:] = float("nan")
+    else:
+        # Kernel 2: atomic reduction into a scalar.
+        scale = 1.0 / N if reduction == "mean" else 1.0
+        result = torch.zeros(1, dtype=input.dtype, device=input.device)
+
+        reduce_kernel = _cached_make(
+            ntops.kernels.reduce_sum.premake,
+            block_size=128,
+            max_num_configs=1,
+        )
+        reduce_kernel(output, result, scale)
+
+        if single_sample:
+            result = result.squeeze()
+
+    return result
+
+
+def _validate_args(input, target, reduction):
+    """Validate input arguments match PyTorch semantics."""
+    if not isinstance(input, torch.Tensor):
+        raise TypeError(f"input must be a Tensor, got {type(input).__name__}")
+    if not isinstance(target, torch.Tensor):
+        raise TypeError(f"target must be a Tensor, got {type(target).__name__}")
+    if reduction not in ("none", "sum", "mean"):
+        raise ValueError(
+            f"reduction must be 'none', 'sum', or 'mean', got '{reduction}'"
+        )
+    if input.dim() not in (1, 2):
+        raise ValueError(
+            f"input must be 1D or 2D, got {input.dim()}D"
+        )
+    if target.dim() != input.dim():
+        raise ValueError(
+            f"target dim ({target.dim()}) must match input dim ({input.dim()})"
+        )
+    if target.dtype != torch.long:
+        raise TypeError(
+            f"target dtype must be torch.long, got {target.dtype}"
+        )
+    if target.shape != input.shape:
+        raise ValueError(
+            f"target shape {tuple(target.shape)} must match input shape {tuple(input.shape)}"
+        )
+    C = input.shape[-1]
+    if C <= 0:
+        raise ValueError(f"number of classes must be > 0, got {C}")
diff --git a/src/ntops/torch/scatter_add.py b/src/ntops/torch/scatter_add.py
new file mode 100644
index 0000000..b785d5e
--- /dev/null
+++ b/src/ntops/torch/scatter_add.py
@@ -0,0 +1,123 @@
+import torch
+
+import ntops
+from ntops.torch.utils import _cached_make
+
+
+def scatter_add(input, dim, index, src):
+    """Scatter add along a dimension.
+
+    Equivalent to input.scatter_add_(dim, index, src) but non-inplace.
+
+    Uses a NineToothed kernel with ntl.atomic_add for scatter.
+    Fixed block_size=128 and max_num_configs=1 are required because
+    autotuning warmup benchmarks would repeatedly add to the same
+    output, corrupting the atomic accumulation.
+    """
+    _validate_scatter_args(input, dim, index, src)
+
+    # Normalize dim.
+    if dim < 0:
+        dim = input.ndim + dim
+
+    if index.numel() == 0:
+        return input.clone()
+
+    # Initialize output as a clone of input — this is fresh every call,
+    # which is essential because atomic kernels are not idempotent.
+    # clone() is always contiguous, so strides match shape semantics.
+    output = input.clone()
+
+    # Slice src to match index shape if src is larger.
+    region = tuple(slice(0, s) for s in index.shape)
+    src_sliced = src[region].contiguous()
+
+    # Use output strides (correct for any layout, even if clone were non-contiguous).
+    strides = output.stride()
+
+    flat_src = src_sliced.reshape(-1)
+    flat_index = torch.zeros(index.numel(), dtype=torch.int64, device=input.device)
+
+    for d in range(input.ndim):
+        if d == dim:
+            flat_index += index.reshape(-1).to(torch.int64) * strides[d]
+        else:
+            coord_d = (
+                torch.arange(index.shape[d], device=input.device)
+                .reshape([-1 if i == d else 1 for i in range(index.ndim)])
+                .expand_as(index)
+                .reshape(-1)
+            )
+            flat_index += coord_d * strides[d]
+
+    # Use fixed block_size=128 (not auto-tuned) and max_num_configs=1
+    # to prevent autotuning warmup from corrupting atomic output.
+    kernel = _cached_make(
+        ntops.kernels.scatter_add.premake, block_size=128, max_num_configs=1
+    )
+
+    kernel(flat_index, flat_src, output)
+
+    return output
+
+
+def _validate_scatter_args(input, dim, index, src):
+    """Validate arguments match PyTorch scatter_add semantics."""
+    if not isinstance(input, torch.Tensor):
+        raise TypeError(f"input must be a Tensor, got {type(input).__name__}")
+    if not isinstance(index, torch.Tensor):
+        raise TypeError(f"index must be a Tensor, got {type(index).__name__}")
+    if not isinstance(src, torch.Tensor):
+        raise TypeError(f"src must be a Tensor, got {type(src).__name__}")
+
+    ndim = input.ndim
+    if dim < -ndim or dim >= ndim:
+        raise IndexError(
+            f"dim {dim} out of range for input of {ndim} dimensions"
+        )
+
+    if index.dtype != torch.int64:
+        raise TypeError(
+            f"index dtype must be torch.int64 (long), got {index.dtype}"
+        )
+
+    if index.ndim != ndim:
+        raise ValueError(
+            f"index ndim ({index.ndim}) must match input ndim ({ndim})"
+        )
+    if src.ndim != ndim:
+        raise ValueError(
+            f"src ndim ({src.ndim}) must match input ndim ({ndim})"
+        )
+
+    if src.device != input.device or index.device != input.device:
+        raise ValueError(
+            "input, index, and src must be on the same device"
+        )
+
+    if src.dtype != input.dtype:
+        raise TypeError(
+            f"src dtype ({src.dtype}) must match input dtype ({input.dtype})"
+        )
+
+    for d in range(ndim):
+        if d != dim and index.shape[d] > input.shape[d]:
+            raise ValueError(
+                f"index.shape[{d}] ({index.shape[d]}) must be <= "
+                f"input.shape[{d}] ({input.shape[d]})"
+            )
+        if index.shape[d] > src.shape[d]:
+            raise ValueError(
+                f"index.shape[{d}] ({index.shape[d]}) must be <= "
+                f"src.shape[{d}] ({src.shape[d]})"
+            )
+
+    if index.numel() > 0:
+        min_val = index.min().item()
+        max_val = index.max().item()
+        normalized_dim = dim if dim >= 0 else dim + ndim
+        if min_val < 0 or max_val >= input.shape[normalized_dim]:
+            raise IndexError(
+                f"index values must be in [0, {input.shape[normalized_dim]}), "
+                f"got [{min_val}, {max_val}]"
+            )
diff --git a/tests/test_frac.py b/tests/test_frac.py
new file mode 100644
index 0000000..eb4d139
--- /dev/null
+++ b/tests/test_frac.py
@@ -0,0 +1,22 @@
+import pytest
+import torch
+
+import ntops
+from tests.skippers import skip_if_cuda_not_available
+from tests.utils import generate_arguments
+
+
+# Filter out float16 since Triton floor/cast does not support it
+_argument_names, _argument_values = generate_arguments()
+_filtered_values = [v for v in _argument_values if v[1] != torch.float16]
+
+
+@skip_if_cuda_not_available
+@pytest.mark.parametrize(_argument_names, _filtered_values)
+def test_frac(shape, dtype, device, rtol, atol):
+    input = torch.randn(shape, dtype=dtype, device=device)
+
+    ninetoothed_output = ntops.torch.frac(input)
+    reference_output = torch.frac(input)
+
+    assert torch.allclose(ninetoothed_output, reference_output, rtol=rtol, atol=atol)
diff --git a/tests/test_fractional_max_pool.py b/tests/test_fractional_max_pool.py
new file mode 100644
index 0000000..4539df1
--- /dev/null
+++ b/tests/test_fractional_max_pool.py
@@ -0,0 +1,136 @@
+import pytest
+import torch
+
+import ntops
+from tests.skippers import skip_if_cuda_not_available
+
+
+@skip_if_cuda_not_available
+@pytest.mark.parametrize("dtype", [torch.float32])
+@pytest.mark.parametrize("kernel_size", [(2, 2), (3, 3)])
+@pytest.mark.parametrize("n,c,h,w,output_size", [(1, 2, 8, 8, (4, 4)), (1, 2, 10, 10, (5, 5))])
+def test_fractional_max_pool2d(n, c, h, w, kernel_size, output_size, dtype):
+    torch.manual_seed(42)
+    input = torch.randn(n, c, h, w, dtype=dtype, device="cuda")
+
+    # Use fixed random samples so both calls produce identical results
+    _random_samples = torch.rand(n, c, 2, device="cuda")
+
+    nt_out = ntops.torch.fractional_max_pool2d(
+        input, kernel_size, output_size=output_size, _random_samples=_random_samples
+    )
+    ref_out = torch.nn.functional.fractional_max_pool2d(
+        input, kernel_size, output_size=output_size, _random_samples=_random_samples
+    )
+    assert torch.allclose(nt_out, ref_out)
+
+
+@skip_if_cuda_not_available
+@pytest.mark.parametrize("dtype", [torch.float32])
+@pytest.mark.parametrize("kernel_size", [(2, 2, 2)])
+@pytest.mark.parametrize("n,c,d,h,w,output_size", [(1, 2, 8, 8, 8, (4, 4, 4))])
+def test_fractional_max_pool3d(n, c, d, h, w, kernel_size, output_size, dtype):
+    torch.manual_seed(42)
+    input = torch.randn(n, c, d, h, w, dtype=dtype, device="cuda")
+    _random_samples = torch.rand(n, c, 3, device="cuda")
+
+    nt_out = ntops.torch.fractional_max_pool3d(
+        input, kernel_size, output_size=output_size, _random_samples=_random_samples
+    )
+    ref_out = torch.nn.functional.fractional_max_pool3d(
+        input, kernel_size, output_size=output_size, _random_samples=_random_samples
+    )
+    assert torch.allclose(nt_out, ref_out)
+
+
+# ---- Edge case tests ----
+
+
+@skip_if_cuda_not_available
+def test_fractional_max_pool2d_output_size_1():
+    """output_size=1: last-window branch, alpha division skipped."""
+    torch.manual_seed(42)
+    inp = torch.randn(1, 2, 6, 6, device="cuda")
+    rs = torch.rand(1, 2, 2, device="cuda")
+    nt = ntops.torch.fractional_max_pool2d(inp, (2, 2), output_size=(1, 1), _random_samples=rs)
+    ref = torch.nn.functional.fractional_max_pool2d(inp, (2, 2), output_size=(1, 1), _random_samples=rs)
+    assert torch.allclose(nt, ref)
+
+
+@skip_if_cuda_not_available
+def test_fractional_max_pool2d_nan():
+    """NaN in input should propagate via isnan(val) check."""
+    inp = torch.zeros(1, 1, 4, 4, device="cuda")
+    inp[0, 0, 1, 1] = float("nan")
+    rs = torch.zeros(1, 1, 2, device="cuda")  # sample 0 → all windows start at 0
+    nt = ntops.torch.fractional_max_pool2d(inp, (2, 2), output_size=(2, 2), _random_samples=rs)
+    ref = torch.nn.functional.fractional_max_pool2d(inp, (2, 2), output_size=(2, 2), _random_samples=rs)
+    # NaN equality needs special handling
+    assert torch.allclose(nt[~nt.isnan()], ref[~ref.isnan()], equal_nan=True)
+    assert nt.isnan().equal(ref.isnan())
+
+
+@skip_if_cuda_not_available
+@pytest.mark.parametrize("return_indices", [False, True])
+def test_fractional_max_pool2d_return_indices(return_indices):
+    """return_indices: verify indices are flat spatial offsets."""
+    torch.manual_seed(42)
+    inp = torch.arange(1, 17, dtype=torch.float32).reshape(1, 1, 4, 4).cuda()
+    rs = torch.zeros(1, 1, 2, device="cuda")
+    result = ntops.torch.fractional_max_pool2d(
+        inp, (2, 2), output_size=(2, 2), _random_samples=rs, return_indices=return_indices
+    )
+    ref = torch.nn.functional.fractional_max_pool2d(
+        inp, (2, 2), output_size=(2, 2), _random_samples=rs, return_indices=return_indices
+    )
+    if return_indices:
+        nt_vals, nt_idx = result
+        ref_vals, ref_idx = ref
+        assert torch.equal(nt_idx, ref_idx)
+        assert torch.equal(nt_vals, ref_vals)
+    else:
+        assert torch.equal(result, ref)
+
+
+@skip_if_cuda_not_available
+def test_fractional_max_pool2d_unbatched():
+    """Unbatched input (C, H, W).  Random samples auto-unsqueezed by wrapper."""
+    torch.manual_seed(42)
+    inp = torch.randn(2, 8, 8, device="cuda")
+    rs = torch.rand(1, 2, 2, device="cuda")  # PyTorch expects (N, C, 2) even for unbatched
+    nt = ntops.torch.fractional_max_pool2d(inp, (2, 2), output_size=(4, 4), _random_samples=rs)
+    ref = torch.nn.functional.fractional_max_pool2d(inp, (2, 2), output_size=(4, 4), _random_samples=rs)
+    assert torch.allclose(nt, ref)
+
+
+@skip_if_cuda_not_available
+def test_fractional_max_pool2d_larger():
+    """Larger batch/channel to exercise tile tail."""
+    torch.manual_seed(42)
+    inp = torch.randn(3, 5, 6, 6, device="cuda")
+    rs = torch.rand(3, 5, 2, device="cuda")
+    nt = ntops.torch.fractional_max_pool2d(inp, (2, 2), output_size=(3, 5), _random_samples=rs)
+    ref = torch.nn.functional.fractional_max_pool2d(inp, (2, 2), output_size=(3, 5), _random_samples=rs)
+    assert torch.allclose(nt, ref, rtol=1e-3)
+
+
+@skip_if_cuda_not_available
+def test_fractional_max_pool2d_output_ratio():
+    """output_ratio instead of output_size."""
+    torch.manual_seed(42)
+    inp = torch.randn(1, 2, 8, 8, device="cuda")
+    rs = torch.rand(1, 2, 2, device="cuda")
+    nt = ntops.torch.fractional_max_pool2d(inp, (2, 2), output_ratio=(0.5, 0.5), _random_samples=rs)
+    ref = torch.nn.functional.fractional_max_pool2d(inp, (2, 2), output_ratio=(0.5, 0.5), _random_samples=rs)
+    assert torch.allclose(nt, ref)
+
+
+@skip_if_cuda_not_available
+def test_fractional_max_pool3d_larger():
+    """3D with different output sizes to exercise tile tail."""
+    torch.manual_seed(42)
+    inp = torch.randn(2, 3, 6, 6, 6, device="cuda")
+    rs = torch.rand(2, 3, 3, device="cuda")
+    nt = ntops.torch.fractional_max_pool3d(inp, (2, 2, 2), output_size=(3, 3, 3), _random_samples=rs)
+    ref = torch.nn.functional.fractional_max_pool3d(inp, (2, 2, 2), output_size=(3, 3, 3), _random_samples=rs)
+    assert torch.allclose(nt, ref, rtol=1e-3)
diff --git a/tests/test_multilabel_margin_loss.py b/tests/test_multilabel_margin_loss.py
new file mode 100644
index 0000000..9d6387c
--- /dev/null
+++ b/tests/test_multilabel_margin_loss.py
@@ -0,0 +1,21 @@
+import pytest
+import torch
+
+import ntops
+from tests.skippers import skip_if_cuda_not_available
+
+
+@skip_if_cuda_not_available
+@pytest.mark.parametrize("dtype", [torch.float32])
+@pytest.mark.parametrize("reduction", ["mean", "sum", "none"])
+@pytest.mark.parametrize("n_labels", [3, 5])
+def test_multilabel_margin_loss(n_labels, reduction, dtype):
+    torch.manual_seed(42)
+    batch_size = 2
+    input = torch.randn(batch_size, n_labels, dtype=dtype, device="cuda")
+    target = torch.zeros(batch_size, n_labels, dtype=torch.long, device="cuda")
+    target[:, 0] = 1  # first label is positive
+
+    nt_out = ntops.torch.multilabel_margin_loss(input, target, reduction=reduction)
+    ref_out = torch.nn.functional.multilabel_margin_loss(input, target, reduction=reduction)
+    assert torch.allclose(nt_out, ref_out)
diff --git a/tests/test_scatter_add.py b/tests/test_scatter_add.py
new file mode 100644
index 0000000..f51ac21
--- /dev/null
+++ b/tests/test_scatter_add.py
@@ -0,0 +1,75 @@
+import pytest
+import torch
+
+import ntops
+from tests.skippers import skip_if_cuda_not_available
+
+
+@skip_if_cuda_not_available
+@pytest.mark.parametrize("dtype", [torch.float32])
+@pytest.mark.parametrize("dim", [0, 1])
+@pytest.mark.parametrize(
+    "input_shape, index_shape, src_shape",
+    [
+        ((3, 5), (3, 2), (3, 2)),
+        ((4, 4), (4, 3), (4, 3)),
+    ],
+)
+def test_scatter_add(input_shape, index_shape, src_shape, dim, dtype):
+    input = torch.randn(input_shape, dtype=dtype, device="cuda")
+    src = torch.randn(src_shape, dtype=dtype, device="cuda")
+    index = torch.randint(0, input_shape[dim], index_shape, device="cuda")
+
+    nt_out = ntops.torch.scatter_add(input, dim, index, src)
+    ref_out = input.clone().scatter_add_(dim, index, src)
+    assert torch.allclose(nt_out, ref_out)
+
+
+@skip_if_cuda_not_available
+def test_scatter_add_large():
+    """>128 elements — exercises multiple kernel blocks."""
+    torch.manual_seed(42)
+    input = torch.randn(50, 10, device="cuda")  # 500 elements > 128
+    src = torch.randn(50, 10, device="cuda")
+    index = torch.randint(0, 50, (50, 10), device="cuda")
+    nt_out = ntops.torch.scatter_add(input, 0, index, src)
+    ref_out = input.clone().scatter_add_(0, index, src)
+    assert torch.allclose(nt_out, ref_out)
+
+
+@skip_if_cuda_not_available
+def test_scatter_add_repeated_dst():
+    """Multiple src elements mapping to the same output position."""
+    input = torch.zeros(5, device="cuda")
+    src = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0], device="cuda")
+    index = torch.tensor([2, 2, 2, 2, 2], device="cuda").unsqueeze(0)
+    src_2d = src.unsqueeze(0)
+    nt_out = ntops.torch.scatter_add(input.unsqueeze(0), 1, index, src_2d)
+    ref_out = input.unsqueeze(0).clone().scatter_add_(1, index, src_2d)
+    assert torch.allclose(nt_out, ref_out)
+    assert nt_out[0, 2].item() == 15.0  # 1+2+3+4+5
+
+
+@skip_if_cuda_not_available
+def test_scatter_add_repeated_calls():
+    """Multiple calls with same kernel — no autotune pollution."""
+    input = torch.randn(3, 5, device="cuda")
+    index = torch.randint(0, 3, (3, 2), device="cuda")
+    src = torch.randn(3, 2, device="cuda")
+    # Run 3 times, verify consistency
+    results = []
+    for _ in range(3):
+        results.append(ntops.torch.scatter_add(input, 0, index, src))
+    for r in results[1:]:
+        assert torch.equal(results[0], r)
+
+
+@skip_if_cuda_not_available
+def test_scatter_add_negative_dim():
+    """Negative dim normalization."""
+    input = torch.randn(3, 5, device="cuda")
+    index = torch.randint(0, 5, (3, 2), device="cuda")
+    src = torch.randn(3, 2, device="cuda")
+    nt_out = ntops.torch.scatter_add(input, -1, index, src)
+    ref_out = input.clone().scatter_add_(-1, index, src)
+    assert torch.allclose(nt_out, ref_out)