Slot-keyed rule dictionary consumed on demand by ops-design.md and the scaffold-op skill. Each ### Slot S{N} entry states the authoritative Rule, its manifest Derivation, a concrete Example modelled on tileops/ops/reduction/cumsum.py, and Common mistakes. Non-slot content lives in the appendices. Slot IDs S8–S11 are intentionally absent (reserved during iteration for T1 thin-wrapper slots later declared out of scope).
- Rule. File begins with a triple-quoted docstring. First paragraph is a short module-level summary (e.g., "Cumulative sum operator (L2 Op layer)."). Optionally followed by a
Provides:bullet block listing the concrete op classes with one-line semantics per class (<ClassName>: <one-line semantics>). - Derivation. Class name from S6; semantics templated from manifest
ref_apiandsignature. - Example.
"""Cumulative sum operator (L2 Op layer). Provides: - CumsumFwdOp: y = cumsum(x, dim=-1) """
- Common mistakes. Referencing tile sizes or kernel-internals in the module docstring; omitting the one-line purpose.
- Rule. Import
Kernelwheneverkernel_maptyping is annotated. - Derivation. Fixed import path.
- Example.
from tileops.kernels.kernel_base import Kernel
- Common mistakes. Aliasing the import; re-exporting
Kernel.
- Rule. One absolute import from
tileops.kernels.*per Kernel class listed in the manifestkernel_map. - Derivation. Manifest
kernel_mapvalues. - Example.
from tileops.kernels.reduction.cumulative import CumulativeKernel
- Common mistakes. Relative cross-package import; importing a kernel not in
kernel_map.
- Rule. Relative import of the L1
Opbase class. - Derivation. Fixed:
from ..op_base import Op(orfrom .op_base import Opfor ops directly undertileops/ops/). - Example.
from ..op_base import Op
- Common mistakes. Absolute
tileops.ops.op_baseimport — violates the relative-import rule in.claude/rules/code-style.md.
- Rule.
__all__contains exactly the concrete op class name (S6). - Derivation.
[<ClassName>]. - Example.
__all__ = ["CumsumFwdOp"]
- Common mistakes. Re-exporting the
Kernelclass; omitting__all__.
- Rule.
{PascalCaseName}{Direction}Op,Direction∈ {Fwd,Bwd}, no exceptions. Manifest entry key must equalcls.__name__verbatim. - Derivation. Manifest entry key.
- Example.
class CumsumFwdOp(Op):
- Common mistakes. Direction suffix missing; abbreviation mis-casing (see Naming Conventions (Appendix)).
- Rule. One-sentence summary;
Args:block enumerating every__init__kwarg (S12) with type and short description; optionalExample:block. - Derivation.
Argsblock from manifestsignature.params+static_dims+dtype. - Example.
class CumsumFwdOp(Op): """Cumulative sum operator: y = cumsum(x, dim=-1). Output has the same shape and dtype as input. Args: M: Number of rows (product of all dims except the reduction axis). N: Hidden dimension (size along the reduction axis). dtype: Data type (float32, float16, or bfloat16). dim: Reduction dimension (default -1). kernel_map: Optional override for kernel dispatch. tune: Whether to autotune (default False). """
- Common mistakes. Args out of sync with
__init__; listing tensor inputs inArgs(they belong toforward).
- Rule. Keyword-only via
*. Kwarg block order: (1)static_dimsentries in manifest key order, no defaults; (2)dtype; (3)signature.paramsentries in manifest key order; (4)kernel_mapandtunelast. - Derivation. Manifest
static_dims+dtype+signature.params. - Example.
def __init__( self, *, M: int, N: int, dtype: torch.dtype, dim: int = -1, kernel_map: Optional[Dict[str, Kernel]] = None, tune: bool = False, ):
- Common mistakes. Missing
*(positional accepted);static_dimskwargs with defaults; params/static_dims block order inverted; kwargs not backed by a manifest source.
- Rule. Body sequence: (a)
self.<name> = <name>per kwarg; (b)self.dispatch_kernel(kernel_map); then branch by op shape:- Fully-static op (all non-static axes committed at ctor): (c-static)
self.kernel = self.kernel_map[<key>](...)— kernel built once at init; (d-static) optionally precomputeself._infer_output_shapes(<input>_shape=(...))eagerly if a caller needs the output shapes beforeforward(). TheOpbase class does not currently consume an_output_shapesattribute — do not introduce one unless a concrete consumer requires it. - Arbitrary-rank op (at least one axis unknown until forward): (c-dyn) initialise
self._kernel_cache: Dict[Hashable, Kernel] = {}(the cache key followsOp._cache_key'sHashablereturn type — often a tuple, but overrides may returnintor other hashables) and defer kernel construction toforward()keyed byself._cache_key(*input_shapes); (d-dyn) defer_infer_output_shapestoforward()per unique input shape.
- Fully-static op (all non-static axes committed at ctor): (c-static)
- Derivation. Each
self.*assignment mirrors one S12 kwarg. Kernel-build positional args follow the kernel class's ctor (kernel author's API). "Fully-static" iff everysignature.inputsshape axis is either a manifestshapedim name or astatic_dimskey resolvable at ctor; otherwise arbitrary-rank and the deferred branch applies. - Example (arbitrary-rank;
CumsumFwdOp).self.N = N self.dtype = dtype self.dim = dim self.tune = tune self.N_padded = align_up(N, DEFAULT_ALIGNMENT) self.dispatch_kernel(kernel_map) # M unknown at init (only N committed via static_dims); kernel # is built lazily in forward() once M is derived. self._kernel_cache: Dict[Hashable, Kernel] = {}
- Common mistakes.
_infer_output_shapescalled beforedispatch_kernel; hard-coding the kernel class instead of routing throughself.kernel_map; building the kernel in__init__for an arbitrary-rank op (fails when a non-static axis value is required by the kernel ctor); omittingself._kernel_cacheinitialisation for the deferred branch (first forward-time cache lookup raisesAttributeError).
- Rule.
@propertyreturning the manifestkernel_mapdict literal withsnake_casekeys and Kernel-class values. - Derivation. Manifest
kernel_map, verbatim. - Example.
@property def default_kernel_map(self) -> Dict[str, Kernel]: return {"cumulative_fwd": CumulativeKernel}
- Common mistakes. Class-level dict (not a property); keys that duplicate the class name instead of being dispatch strings.
- Rule. Positional tensor parameters in manifest
signature.inputsorder; return annotationtorch.TensororTuple[torch.Tensor, ...]matchingsignature.outputs. - Derivation. Manifest
signature.inputsfor names;signature.outputsfor return annotation. - Example.
def forward(self, x: torch.Tensor) -> torch.Tensor:
- Common mistakes. Keyword-only tensor parameters; non-tensor kwargs in
forward(they belong to__init__).
- Rule. Body sequence: (a)
self._validate_dtypes(...); (b) validateshape_rules(e.g.-x.ndim <= dim < x.ndim) and normalise parameter-dependent axes via modulo (e.g.dim = self.dim % x.ndim); (c) validate eachstatic_dimscommitment (x.shape[<resolved_axis>] == self.<kwarg>); (d) for arbitrary-rank ops, bindself._static_axes = frozenset({(input_index, resolved_axis)})and look up / lazily build the kernel inself._kernel_cachekeyed byself._cache_key(*input_shapes); (e).contiguous()+ reshape to the kernel's expected 2D layout; (f) call the kernel; (g) trim alignment padding (if any) and restore the original shape. Fully-static ops skip the cache-lookup part of (d) sinceself.kernelwas built at init. - Derivation. Validation expressions come from each
static_dimsentry's<tensor>.shape[<axis>]RHS; axis normalisation mirrors the param evaluation instatic_dims+shape_rules; kernel cache key is whatever_cache_keyprojects (default: tuple of non-static-axis sizes). Padding trim applies when the kernel operates onalign_up(N, DEFAULT_ALIGNMENT)(self.N_padded != self.N). - Example (arbitrary-rank;
CumsumFwdOp).self._validate_dtypes(x) if not x.is_cuda: raise ValueError("x must be a CUDA tensor") if not -x.ndim <= self.dim < x.ndim: raise ValueError(f"dim {self.dim} out of range for x.ndim={x.ndim}") dim = self.dim % x.ndim if x.shape[dim] != self.N: raise ValueError( f"static_dim mismatch: expected x.shape[{dim}] == {self.N}, " f"got {x.shape[dim]}" ) self._static_axes = frozenset({(0, dim)}) M = math.prod(s for i, s in enumerate(x.shape) if i != dim) self.M = M # default _cache_key projects non-static axes; override for coarser # keying when kernel math permits (see Optional Hooks appendix). key = self._cache_key(x.shape) if key not in self._kernel_cache: self._kernel_cache[key] = self.kernel_map["cumulative_fwd"]( M, self.N, "sum", self.dtype, tune=self.tune ) kernel = self._kernel_cache[key] orig_shape = x.shape x2 = x.movedim(dim, -1).contiguous().reshape(M, self.N) y2 = kernel(x2) if self.N_padded != self.N: y2 = y2[:, : self.N] y = y2.reshape(*orig_shape[:dim], *orig_shape[dim + 1 :], self.N) return y.movedim(-1, dim)
- Common mistakes. Skipping
_validate_dtypes; reshape before.contiguous(); hard-codingx.shape[-1]instead of the normalisedx.shape[self.dim % x.ndim]; bindingself._static_axesbefore the axis is non-negative (violatesOp._static_axescontract); forgetting the kernel cache lookup so every forward rebuilds the kernel; forgetting the padding trim whenself.N_padded != self.N(causesreshape(orig_shape)to raise on size mismatch); not restoring the original shape.
- Rule. Signature takes
<input>_shape: tupleper manifestsignature.inputs, returnsDict[str, tuple]keyed by output name. The L1 base raisesNotImplementedErroras aFIXME(staged-rollout)stub; each concrete op supplies a complete body. PR #1005's validator exercises the method with mock inputs at CI and reports disagreement withshape_rulesas a hard L2 error. - Derivation. Manifest
shape_rules(see manifest.md § Rules). - Example.
def _infer_output_shapes(self, x_shape: tuple) -> Dict[str, tuple]: return {"y": x_shape}
- Common mistakes. Shape tuple disagreeing with
shape_rules(hard L2 error); accepting/returningtorch.Tensorinstead of shape tuples; demoting an op tostatus: spec-onlyto silence a genuine disagreement (only legitimate when the impl truly does not conform).
- Rule. Positional parameters match
signature.inputs; raisesValueErroron invalid dtype combinations. L1 stub raisesNotImplementedError(FIXME staged-rollout). PR #1005's validator exhaustively probesdtype_combos/ declared unions + out-of-union negatives and reports divergence as hard L3 error. - Derivation. Manifest
dtype(union) anddtype_combos. - Example.
def _validate_dtypes(self, x: torch.Tensor) -> None: if x.dtype not in {torch.float32, torch.float16, torch.bfloat16}: raise ValueError(f"x.dtype must be float32/float16/bfloat16, got {x.dtype}")
- Common mistakes. Accepting a dtype outside the declared union; rejecting a dtype listed in
dtype_combos; ignoringsame_as(ref)linkage between inputs.
- Rule. Codegen emits a complete plain-Python body reading
self.*attributes. Perroofline.md§4.4.6 (Evaluator Surface Boundary) there is NO shared AST evaluator on L1 and NO class-level roofline expression strings (e.g._flops_str,_bytes_str,_roofline_vars) that would be parsed at runtime. L1 stub raisesNotImplementedError(FIXME staged-rollout). - Derivation. Manifest
roofline.vars,roofline.flops,roofline.bytes; seeroofline.md§4.4. - Example.
def eval_roofline(self) -> tuple[int, int]: flops = 4 * self.M * self.N bytes_ = (2 * self.M * self.N + self.N) * self.dtype.itemsize return flops, bytes_
- Common mistakes. Class-level roofline expression strings parsed at runtime (prohibited by §4.4.6); any
ast.parseor shared_safe_evalpath; returningfloatornumpytypes (contract istuple[int, int]).
- Rule.
tileops/ops/{family}/__init__.pygains onefrom .<module> import <ClassName>line plus a matching<ClassName>entry in__all__, placed under the family's grouping comment block. - Derivation. Class name (S6) and the op's source filename.
- Example.
# --- CumulativeKernel ops --- from .cumsum import CumsumFwdOp
- Common mistakes. Import outside its family grouping comment; missing
__all__entry (silently breaksimport *).
-
Rule. Each concrete op declares
_static_axes: frozenset[tuple[int, int]]of(input_index, axis)pairs, whereinput_indexis the positional index insignature.inputsandaxisis a non-negative integer within that input's shape. The commitment happens at one of two points:- Ctor time, as a class-level literal, when every axis can be resolved to a non-negative integer without knowing runtime rank (e.g., manifest declares
static_dims: M: "x.shape[0]"). forward()time, with an empty class-level default, when at least one axis depends on runtime rank — most commonly a ctor param that may be negative (e.g.,static_dims: N: "x.shape[dim]"withdimdefaulting to-1). At forward, the concrete op normalises the axis (dim % x.ndim), then assignsself._static_axes = frozenset({(i, <resolved_axis>)}). Equivalently the op may override_cache_keyand project the shape inline without ever populating_static_axes.
Empty frozenset is legal as the class-level default (means "no axes committed yet"). Negative axes MUST NOT be stored in
_static_axeswithout prior normalisation — theOpbase class relies on non-negative indexing into*input_shapes. - Ctor time, as a class-level literal, when every axis can be resolved to a non-negative integer without knowing runtime rank (e.g., manifest declares
-
Derivation. Manifest
static_dims; for each entry<kwarg>: <tensor>.shape[<axis>]:- If
<axis>is resolvable to a non-negative integer literal at class-definition time → emit class-level_static_axes = frozenset({(input_index_of_<tensor>, <axis>)}). - If
<axis>is a ctor param name, or is written as a negative literal whose normalised value depends on runtime rank → emit_static_axes = frozenset()at class level and assignself._static_axes = frozenset({(i, <param> % x.ndim)})insideforward()after thestatic_dimscommitment check, or override_cache_keyto project inline.
PyTorch-aligned reductions with
dim=None→ empty frozenset (see manifest.md § Empty static_dims). - If
-
Example.
class CumsumFwdOp(Op): # static_dims: N: "x.shape[dim]" — axis is parameter-dependent # (and dim may be negative), so the concrete (input_index, axis) # pair is resolved at forward() time after dim % x.ndim # normalization. Class-level default is empty. _static_axes: frozenset[tuple[int, int]] = frozenset()
-
Common mistakes. Omitting
_static_axesentirely whenstatic_dimsis non-empty (relies onOp's empty default, silently disables static-axis projection in_cache_key); emitting a literal(input_index, axis)pair whenaxisis actually a ctor param (produces a wrong axis under arbitrary rank); bindingself._static_axesinside__init__when the axis comes from a param —x.ndimis not known yet, so a negativedimcannot be normalized (bind atforward()instead); storing a negative axis (must be non-negative perop_base.py); empty_static_axeswithout overriding_cache_key(emits a once-per-typeUserWarning— see Optional Hooks (Appendix)).
Per-family protocol variables, declared by L2 bases and overridden by L3 ops.
| Variable | Family | Purpose |
|---|---|---|
_kernel_key |
norm, reduction | Kernel-map lookup key |
_kernel_cls |
norm, reduction | Kernel class reference |
_op_kind |
reduction, scan | Kernel-dispatch op-kind string ("sum" / "prod" for CumulativeOp; "sum", "mean", … for _ReduceOpBase) |
_kernel_handles_padding |
reduction | True → kernel uses masked loads, skip host-side padding |
_op_name |
elementwise | torch.library.custom_op registration key |
kernel_cls |
elementwise | Kernel class reference |
The scaffold-op skill does NOT emit these variables — kernel-dispatch-convention-dependent (e.g., VectorNormKernel uses {"l1", "l2", "inf"}, ReduceKernel uses {"sum", "mean", ...}); filled in during family-specific refactoring (future skill). Adding a new protocol variable requires updating the L2 base, all concrete ops, and the manifest schema if applicable.
Op base class attributes (tileops/ops/op_base.py)
| Attribute | Type | Purpose |
|---|---|---|
kernel |
Kernel |
Kernel instance used by forward() |
kernel_map |
Optional[Dict[str, Kernel]] |
Dispatched kernels keyed by name |
dtype |
Optional[torch.dtype] |
Computation dtype |
device |
Optional[Union[torch.device, str]] |
Device (default 'cuda') |
input_shapes |
Optional[list[tuple]] |
Expected input tensor shapes (for introspection and non-runtime consumers) |
_static_axes |
frozenset[tuple[int, int]] |
Static axes as (input_index, axis) pairs (default frozenset()); consumed by _cache_key |
Abstract interface: default_kernel_map (property), forward(). Manifest-driven methods (codegen-emitted by concrete ops): _infer_output_shapes, _validate_dtypes, eval_roofline.
Kernel base class attributes (tileops/kernels/kernel_base.py)
| Attribute | Type | Purpose |
|---|---|---|
dtype |
Optional[torch.dtype] |
Data type |
config |
Dict[str, Any] |
Tile configuration (block sizes, stages, etc.) |
autotune_configs |
Optional[list[dict]] |
Search space for autotuning |
supported_archs |
Optional[list[int]] |
GPU SM versions (e.g., [80, 86, 89, 90]) |
kernel |
Callable |
Compiled TileLang kernel function |
Abstract interface: forward(). Key methods: init_config(config, tune), autotune(warmup, rep).
Hooks family bases expose for op-specific semantics. The scaffold-op skill does NOT emit these.
| Hook | Family | Default | Override example |
|---|---|---|---|
_pad_value() |
reduction | 0.0 (neutral for sum) |
ArgmaxFwdOp._pad_value → -inf (tileops/ops/reduction/argmax.py:61) |
_validate_dim() |
reduction | accept int or list[int] |
ArgmaxFwdOp._validate_dim restricts to scalar int |
_pre_kernel() |
reduction | identity | AllFwdOp._pre_kernel converts unsupported storage dtypes to fp32 |
_post_kernel() |
reduction | identity | Convert kernel output dtype to the manifest-declared output dtype |
Op._cache_key(self, *input_shapes) -> Hashable defaults to projecting non-static axes via self._static_axes. Override when the kernel's math permits coarser keying — e.g., RMSNorm only depends on the non-static axis product M:
class RMSNormFwdOp(Op):
def _cache_key(self, x_shape):
dim = self.dim % len(x_shape)
return (math.prod(s for i, s in enumerate(x_shape) if i != dim),)When _static_axes is empty, override is mandatory — the default keys by the full input shape (one kernel compile per distinct shape). The base emits a once-per-type UserWarning when invoked with empty _static_axes and no subclass override.
- Op class:
{PascalCaseName}{Direction}Op.Direction∈ {Fwd,Bwd}, mandatory. Manifest key must equalcls.__name__. Abbreviation casing:RMSNormFwdOp,SSDDecodeOp— fully uppercase per.claude/rules/code-style.md. Slot S6. - Kernel class:
{PascalCaseName}{Direction}Kernel. Same direction-suffix rule. kernel_mapkeys:snake_case, decoupled from Kernel class names. Values must match the Kernelcls.__name__. The table does not describe dispatch strategy. Slot S14.- Builder functions:
snake_case, e.g.def rms_norm_fwd(M, N, dtype, ...): .... - Filenames: all-lowercase with underscores. Multi-word abbreviations stay fully lowercase (
rms_norm.py,ssd_decode.py; neverRMSNorm.pyorSsd_decode.py). Norm-related names never contract (rms_norm, notrmsnorm).
The manifest (tileops/manifest/) is the sole source of truth. Dtype validation and shape inference derive from manifest; roofline codegen is defined in roofline.md.
Three time points: (1) manifest — constraint structure; (2) __init__ — user commits static_dims values; (3) forward — shapes concrete, commitments validated. See manifest.md § static_dims.
| Fixed-rank op | Arbitrary-rank op | |
|---|---|---|
Manifest has shape |
yes | no |
__init__ shape source |
shape dimension names |
static_dims |
| Undeclared dimensions | none | derived from tensor at forward time |
| Kernel construction time | init (all dims known) | init (static_dims known) or forward (first encounter, cached) |
| Forward cache keying | N/A (single kernel) | _cache_key(*input_shapes) — default non-static axes, overridable |
- Fully static op:
_infer_output_shapescalled once in__init__, result stored as an instance attribute. - Op with dynamic dims:
_infer_output_shapescalled inforward()once dynamic dims resolve; kernel construction cached by_cache_key(*input_shapes). _validate_dtypes: runs on everyforward()call.- Non-runtime consumers (validator, graph compiler): call
_infer_output_shapeswith concrete shape tuples without constructing tensors. Roofline consumers use interfaces inroofline.md.
| Scenario | Codegen method defined at | Concrete op action |
|---|---|---|
| Family shares logic | L2 family base | Inherits, no override |
| Family member has variant logic (e.g., multi-output) | L3 concrete op | Overrides |
| Op inherits L1 directly (T2) | L3 concrete op | Scaffold emits body |
| Check | Mechanism |
|---|---|
| Manifest schema and declared fields are well-formed | Validator (CI), L0 checks |
__init__ params match manifest params |
Validator signature check (L1) |
static_dims keys are __init__ parameters |
Validator signature check (L1) |
shape_rules syntax is valid |
Validator shape_rules parsing (L2) |
_infer_output_shapes output satisfies shape_rules |
Validator infer-shape parity (L2; PR #1005) |
dtype/dtype_combos strings are valid |
Validator dtype conformance (L3) |
_validate_dtypes matches dtype_combos / dtype unions |
Validator dtype parity (L3; PR #1005) |
Empty static_dims without _cache_key override |
Op base class runtime warning |
Checks beyond this table are tracked as separate issues, not as spec status.
Parity check coverage. The L2 / L3 parity checks compare the manifest spec against the concrete method the op class defines. When the class has not migrated to the codegen protocol, the validator emits a warning naming the missing method — the gap is surfaced, never silently passed. When the method exists, the parity check runs and any disagreement is a hard L2 / L3 error. Ops whose method genuinely cannot be invoked in a CPU-only validator context must declare status: spec-only; there is no parity opt-out, and demotion is only legitimate when the implementation truly does not conform.
Pragmatic sequence:
- New op inherits L1 directly (T2). When a family has 1-2 ops, the op owns its full
forward(). Transitional state. - Family accumulates ops. When 2-3 ops share identical
forward()flow, extract an L2 family base. - L1-direct and L1→L2→L3 coexist. L1-direct ops are candidates for future L2 extraction, not an alternative design.
Create an L2 family base when multiple ops share the same forward() control flow, the shared boilerplate is substantial, and per-op differences fit into class variables or hooks. Do NOT create one when only 1 op uses the pattern, ops share math but differ in flow, or a common base would need excessive if/else.
- Implement 2-3 concrete T2 ops to understand the pattern before abstracting.
- Identify shared
forward()steps. - Extract shared steps into the base; lift per-op differences into class variables or overridable hooks (see Family-Base Protocol (Appendix) and Optional Hooks (Appendix)).
- Migrate existing ops; verify tests pass unchanged.
- Register any new protocol variables in the Family-Base Protocol table.