fdat_rd (Fast Dual-Attention Transformer, Rectangular + Dictionary) is an
evolution of FDAT prioritized for animation (but not only) super-resolution (DVD/LD → BluRay),
targeting TensorRT-static deployment via traiNNer-redux. It keeps every FDAT
deployment invariant (static ONNX export, tensor-core alignment, reshape-over-
view, no data-dependent control flow) while considerably improving recovery on the
hardest frames — thin and blurred lineart, sub-pixel chroma in tight structures, degraded eye/hair
curves - where spatial information is difficult to parse.
FDAT - https://github.com/stinkybread/FDAT
└─ rectangular alternating windows (anisotropic lineart prior, resolution-aligned)
└─ token-dictionary cross-attention (global learned structure prior)
└─ FDAT-RD
FDAT (https://github.com/stinkybread/FDAT) is the body this builds on: residual groups alternating a windowed
spatial attention and a transposed channel (MDTA-style) attention, each
fused with a depthwise-conv branch through the AIM gate and followed by a
depthwise-mix FFN, with an unshuffle frontend and a UniUpsampleV3 tail. It is
fast and exports cleanly, but two limitations surface on the hardest ~10% of
frames:
- Square windows are isotropic. The extended thin structures that define cel art are poorly served by a square box, and the box rarely tiles the production feature grid, so inference wastes compute on window padding.
- All attention is local or per-channel. No path lets a pixel query semantically related structure elsewhere in the frame — and that selectivity, not receptive field, is what separates strong restorers on hard cases.
FDAT-RD addresses each with one change.
Spatial blocks use a rectangular window (default 10×30) whose orientation
alternates 10×30 / 30×10 across consecutive spatial blocks, giving each
position long-axis coverage along both axes over a pair of blocks — an
anisotropic prior toward extended line structure. The window also tiles the
720×540 resolution pyramid exactly (everything is a multiple of 30), so
production inference is window-aligned. A single upfront reflect-pad to
unshuffle × lcm(split_size) keeps the body aligned at any input size, so
the per-block attention pad is always a no-op and there is no boundary seam.
A third block type carries a fixed-size learned dictionary of typical cel
structures and lets every pixel query it: the dictionary first refines against
the current image, then image features are enhanced against the refined
dictionary. This is the deployable subset of ATD — fixed-M cross-attention
only, no dynamic category grouping — so it stays static-shape and O(N·M)
linear, hence TRT-clean. It is the architectural generalization of a hand-built
edge/lineart prior: instead of coding the prior, the dictionary learns the
structure bank from the data. Dictionary blocks are lean (no conv/AIM branch);
the global prior is the point and is not diluted with redundant local mixing.
A Swin-style relative-position-bias table replacing the dense (nh, N, N)
spatial bias (~98% fewer params, translation-invariant) was tested and
reverted: the dense bias's extra capacity was doing real work on the hard
cases, and the spatial blocks feed the dictionary, so starving them measurably
hurt detail recovery at matched iterations. Parameter efficiency is not the
objective; tail performance is. The dense bias stays.
- SDPA (FlashAttention / mem-efficient) in the spatial and dictionary attention — identical math to manual softmax, lower training VRAM and faster, with the gain scaling as crops grow. The standalone converter and the spandrel arch use manual softmax for the cleanest export graph; weights are identical.
- Reflect → replicate pad fallback so tiny inputs don't crash.
- Aligned dimensions —
head_dim 32, dictionary sizeMa multiple of 32.
| variant | embed_dim | heads | groups | M | blocks/group |
|---|---|---|---|---|---|
fdat_rd_small |
96 | 3 | 4 | 64 | 6 |
fdat_rd_medium |
128 | 4 | 4 | 128 | 6 |
fdat_rd_large |
192 | 6 | 6 | 256 | 6 |
fdat_rd_aligned is an alias of medium. All use split_size (10,30),
group_block_pattern [spatial, channel, dictionary], and optional
use_checkpoint (train-only, no-op at eval/export).
split_size (10,30)→lcm 30. With2× + unshuffle(feature = lq/2), train atlq_size 120(feature60×60, divisible by both 10 and 30 → zero window padding).- The window token count
10×30 = 300is not a multiple of 8, so TRT tile-pads the sequence dim — a known, accepted trade for the anisotropic prior. use_checkpointis the VRAM lever when scaling to larger crops orlarge.
network_g:
type: fdat_rd_medium
scale: 2
unshuffle_mod: true
use_checkpoint: true # optional
# lq_size: 120- ONNX/TRT:
fdat_rd_converter.py(standalone, autodetects all dims; pass--split-size 10 30). Verify tiny first, then export at production size on GPU:python fdat_rd_converter.py model.safetensors -f onnx-static --input-size 60 60 python fdat_rd_converter.py model.safetensors -f onnx-static --input-size 540 720 --device cuda --no-verify - chaiNNer / spandrel:
fdat_rd.py(arch) +__init__.py(FDATRDArch), drop intospandrel_extra_arches. Detection keys on the dictionary parameter, so it is distinct fromfdat2/dat2rt2. Two notes:- Register
FDATRDArchbeforeFDATArch. Afdat_rdcheckpoint also satisfies FDAT's current detect (its first block is spatial), so first-match order must putfdat_rdahead — or add anot has dictionaryguard to FDAT's detect. - The
split_sizefactorization is not stored in weights; the spandrelloadassumes the standard(10, 30). Add a registered buffer if you ever vary it.
- Register
Faster and lighter to train, clearly better on hard-case crops, at a modest inference cost:
- training throughput up
- training VRAM down
- 9.8 FPS on 720x540p inference with Medium variant and Unshuffle Mod On with a 4080 RTX.
- clearly improved detail recovery across crops (thin lineart, eye/hair)
- ~20% slower at inference — the price of the
300-token rectangular windows that don't tile to a multiple of 8; best evaluated against models at the same latency tier rather than against FDAT directly.