DRM Language Emitter is a causal language model built around a latent state z_t. It emits tokens sequentially, but the mechanism is not sequence attention. The central computation is a learned dynamical system over a Directional Relational Manifold.
The recurrent latent state is written as:
For the MVP, z_t is represented by a vector in R^d_state. This is a coordinate representation of the latent manifold, not a claim that the true geometric object is globally Euclidean.
DRMStateInitializer uses a learned initial state expanded to the batch. Prompt tokens then move the state through the DRM dynamics.
DirectionField(z) returns:
V(z) [B, n_directions, d_state]gates a(z) [B, n_directions]dimD(z) = sum_i a_i(z)
The directions are not orthogonalized. Optional normalization keeps their scale controlled but does not impose an orthonormal frame. The gates define an effective local active dimension.
The metric is:
It is positive definite up to the eps floor and measures energy/coupling of velocities and directions:
pairwise_coupling(z, V) computes relational coupling between learned directions under G(z).
DRMFlow receives z_t, the current token embedding e_t, active directions, and gates. It emits coefficients:
The raw velocity is a gated directional combination:
Therefore the velocity belongs explicitly to the span of active directions.
In the default config, the raw directional velocity is naturalized by the learned metric:
The implementation uses a damped Woodbury solve for the diagonal plus low-rank metric:
The naturalization strength is scheduled during training. This makes the metric part of the movement law while avoiding immediate over-conditioning.
The state update is:
The action term is the mean metric energy of the rollout:
This does not make the model an exact geodesic solver. It biases learned trajectories toward lower action under the current learned metric.
LanguageEmitter(z) is a small MLP with RMSNorm and GELU. It maps the current latent state to vocabulary logits.
For supervised language modeling, the primary loss is token cross entropy:
The training objective combines token prediction with geometric regularization:
The regularizers R_k include the active-fraction target, dimension variance, metric conditioning/diversity terms, recurrence/stability proxies, and optional risk/metric-floor penalties when enabled by config.
Generation warms z with prompt tokens. Then it repeatedly:
- emits logits from
z, - samples the next token,
- embeds that token,
- updates
zthroughDirectionField,RelationalMetric, andDRMFlow.
There is no attention cache.
The project does not instantiate nn.MultiheadAttention, does not construct query/key/value projections, and does not run pairwise token attention. Sequence history is compressed into the trajectory state z_t.
A geodesic in the full DRM sense would minimize an action functional over admissible curves whose velocities remain in span(D(z)). The MVP provides a training pressure and diagnostics for low-action trajectories. It does not solve the boundary-value geodesic problem exactly.
The optional toroidal utility represents circular coordinates as (cos theta, sin theta). The code does not claim spontaneous toroidal convergence. Such a claim would require boundedness, recurrence, structural stability, and empirical diagnostics.