Mozart rolls a dice to Bach and Ramanujan
嗯,用户让我总结这篇文章的内容,控制在一百个字以内,而且不需要特定的开头。首先,我需要仔细阅读文章,理解其主要观点。 文章主要讨论莫扎特的音乐骰子游戏K.516f,以及它如何与现代大型语言模型(LLM)的机制相联系。莫扎特的游戏通过掷骰子生成独特的16小节小步舞曲,每个小节的选择是独立的,没有上下文依赖。这类似于LLM中的自回归生成模型,但上下文窗口为零。 接着,文章提到拉马努金的和函数如何解释为什么16个小节在心理和和谐上感觉完整。最后,文章将巴赫的赋格曲与多头自注意力机制相比较,指出巴赫的作品扩展了莫扎特的框架到高维结构。 所以,总结时需要涵盖莫扎特的游戏机制、与LLM的关系、拉马努金的数学解释以及巴赫与自注意力机制的联系。控制在100字以内,直接描述内容。 </think> 文章探讨了莫扎特《音乐骰子游戏》如何体现严格的自回归生成机制,并将其与大型语言模型、拉马努金和声方程及巴赫赋格曲联系起来。通过数学分析揭示了音乐创作中的对称性和周期性规律。 2026-3-16 00:0:0 Author: toooold.com(查看原文) 阅读量:0 收藏

alt text

The Elegance of Mozart’s Attention Mechanism

In 1792, Mozart’s Musikalisches Würfelspiel (Musical Dice Game), K.516f, was published. The system is deceptively simple: 176 pre-composed musical measures arranged in a grid. The user rolls two six-sided dice ($2d6$) 16 times. Each roll corresponds to a specific measure for that column in the grid, generating a mathematically unique 16-bar minuet.

From a LLM mechanistic interpretability standpoint, the beauty of Mozart’s game is that it is a strictly autoregressive, discrete-token generator with a context window of zero.

In a standard Large Language Model (LLM), predicting the next token $x_t$ relies on the conditional probability of the entire past sequence:

\[P(x_t | x_1, x_2, \dots, x_{t-1})\]

Mozart bypassed the need for this computational overhead. In K.516f, the choice of Measure 3 has zero statistical dependence on Measure 2. The generation is completely memoryless. Instead, the model’s “attention” is 100% focused on its absolute positional encoding (the step $t$): \(P(x_t | \text{position } t, \text{dice roll})\)

How does it remain harmonically coherent without context? Mozart engineered the matrix as an aggressive, hardcoded attention mask. He ensured that every possible measure at $t$ smoothly resolves into every possible measure at $t+1$. Any dissonant, harmonically invalid transition was manually assigned a $-\infty$ pre-softmax penalty by the composer, effectively masking it out of the latent space.

Furthermore, the $2d6$ sampling acts as a physical temperature parameter. By using a triangular probability distribution ($P(7) = 16.7\%$, $P(2) = 2.7\%$) rather than a uniform one, Mozart lowered the entropy of the system. He statistically biased the model to generate the most “standard” harmonic progressions, reserving high-surprise edge cases for the extreme tails of the distribution.

Unifying the Grid: The Ramanujan Sum

If we were to code Mozart’s game today, we would use a simple for loop to force the piece to stop at $t=16$. But why does a 16-measure grid feel psychologically and harmonically complete? To understand this, we must abandon the discrete grid and apply the continuous mathematics of Srinivasa Ramanujan.

Ramanujan would not view Mozart’s matrix as a set of rules, but rather as the natural resonant frequency of a periodic equation. We can model the macro-structure of the minuet using a Ramanujan Sum ($c_q(n)$), which extracts periodic signals from noise:

\[c_q(n) = \sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} e^{2\pi i \frac{a}{q} n}\]

By setting the fundamental period $q = 16$, the equation acts as a harmonic pendulum. Here is how Mozart’s attention mechanism unifies with Ramanujan’s math:

The Journey ($n = 1$ to $15$): As the measures progress, the complex exponentials point in various directions in the complex plane, causing destructive interference. Musically, this represents harmonic tension—the algorithmic wave is wandering through the latent space, seeking resolution.

The Half-Cadence ($n = 8$): When we reach the halfway point, the fraction simplifies to $\frac{a}{2}$. The vectors snap to the real axis. This momentary, symmetrical mathematical pause perfectly mirrors the structural “half-cadence” in classical phrasing.

The Resolution ($n = 16$): At the final measure, the fraction simplifies to an integer. Every term in the sum points in the exact same direction ($e^{2\pi i a} = 1$). The destructive interference vanishes into a massive spike of constructive interference.

The structure doesn’t resolve because of an arbitrary grid boundary; it resolves because $q=16$ ($2^4$, the fractal symmetry of classical phrasing) is the fundamental node where the equation naturally reaches maximum constructive harmony. Mozart’s positional attention mechanism is simply the geometric projection of this periodic equation.

Expanding Dimensions: Bach’s Deep Self-Attention

If Mozart’s dice game is a rigid, 1D loop locked to $q=16$, Johann Sebastian Bach’s beautiful Fugues (The Well-Tempered Clavier which has a beautiful Chinese name 赋格) represent the expansion of this mathematical framework into high-dimensional, deep-memory architectures. A fugue cannot be generated by a zero-context Markov chain like in Mozart’s dice game. It begins with a single “prompt” token sequence: the Subject. When the second voice enters, it must continuously look back at the Subject to generate valid counterpoint.

In LLM terminology, Bach implemented Multi-Head Self-Attention.

Each voice (Soprano, Alto, Tenor, Bass) acts as an independent attention head. They process the exact same context window but project it into different dimensional spaces. While Mozart relied on stochastic dice (sampling), Bach relied on deterministic linear algebra. The initial Subject vector is subjected to complex matrix transformations in the latent space:

  • Transposition (Translation: $f(x) + c$)
  • Inversion (Reflection: $-f(x)$)
  • Augmentation/Diminution (Time Scaling: $f(2t)$ or $f(t/2)$)

Bach also utilized what we mechanistic interpretability researchers call Induction Heads. When the Alto voice enters with the “Answer,” it acts as an attention circuit specifically trained to recognize the sequence in the Soprano’s past and perfectly reconstruct it at the current time step. Meanwhile, the other heads calculate orthogonal vectors (the Countersubject) to ensure the dot product of the combined voices perfectly satisfies the vertical rules of harmony.

If we return to Ramanujan, Bach’s polyphony represents the full, unconstrained analytic continuation of the harmonic equations. While Mozart collapsed the variables into a degenerate case (a rigid loop in C Major), Bach allowed the variables to become complex numbers, unlocking all 24 keys and forcing the equation to expand dynamically across the complex plane.

The Convergence

Whether we are engineering modern Transformers, calculating Ramanujan sums, or analyzing 18th-century manuscripts, the computational goal remains identical. LLM and music generation are ultimately the search for mathematical symmetry across time. Mozart mapped it via hardcoded masking and stochastic geometry; Bach calculated it via deep contrapuntal attention matrices; and Ramanujan provided the equations that prove they are all navigating the exact same latent space.


文章来源: https://toooold.com/2026/03/16/m_b_r_game.html
如有侵权请联系:admin#unsafe.sh