Summary Human languages are organised by pressures that often shorten syntactic dependencies and concentrate much predictive information in short spans — a robust quantitative pattern documented across languages that predates any machine learning system.11 Large language models (LLMs) are trained on human language and inherit this structure. We propose that this inherited locality, rather than any architectural constraint, may help shape creativity in LLMs — producing a sequential analogue of the patch-mosaic creativity mechanism recently identified in convolutional diffusion models. Kamb & Ganguli show that two architectural inductive biases in CNNs — locality of the receptive field and translational equivariance — mathematically preclude learning the ideal score function, producing instead a locally consistent patch mosaic with quantitatively predictable outputs (median r² = 0.94–0.96).1 LLMs lack these architectural constraints, but we argue they face a functionally equivalent constraint sourced elsewhere: in the statistical structure of the language they model. This claim is supported at three levels — cross-linguistic dependency minimisation,11 the concentration of next-token predictive information in recent context formalised as m-local entropy,13 and the empirically documented failure of LLMs to robustly use distant context even when available.10 We conjecture that this inherited locality may help shape a sequential patch-mosaic mechanism in LLMs, and we propose a falsifiable research programme to test it — including a cross-linguistic prediction: LLMs trained on languages with higher m-local entropy should exhibit different effective locality radii and different creativity signatures. The novelty lies in the theoretical bridge from diffusion-model creativity theory to language generation via the structure of human language itself.

Introduction

Large language models produce outputs that are simultaneously novel and locally coherent. The question of where this creativity comes from has attracted sustained theoretical interest, yet a rigorous mechanistic account has remained elusive. Most explanations invoke model scale, emergent representation, or distributional interpolation — properties of the network. We propose that the more productive question is prior to all of these: what is the structure of the object the network is trained on?

Human languages are not arbitrary symbol sequences. They are shaped by millennia of cognitive pressure toward communicative efficiency — pressure that systematically favours keeping syntactically related words close together, making nearby context unusually informative for predicting what comes next. An LLM trained on human language inherits this structure. We propose that this inherited locality — the statistical concentration of predictive information in short spans, traceable to properties of human language itself — is a candidate factor shaping creativity in LLMs, by a mechanism analogous to one recently identified in a very different class of models.

Kamb & Ganguli1 provide the theoretical lens. They show that two architectural inductive biases in convolutional neural networks — locality of the receptive field and translational equivariance via weight sharing — mathematically preclude learning the ideal score function, producing instead a locally consistent patch mosaic: generated images assembled from patches drawn from heterogeneous training examples, locally coherent but globally novel. Their analytic framework predicts individual trained-model outputs with median r² of 0.94–0.96 across four benchmark datasets. The paper explicitly frames extension to transformer-based LLMs as "entirely uncharted."1 The present perspective proposes the direction that extension should take — not by looking for architectural constraints in LLMs, which are largely absent, but by asking whether the training data itself supplies the locality that architecture supplies in CNNs.

The patch-mosaic mechanism

The theoretical machinery of Kamb & Ganguli rests on a central observation: a model trained to recover the optimal score function of its training distribution — the gradient of the log-probability — would, if successful, memorise the training data. Any deviation from this ideal score is therefore the locus of creativity. The two architectural constraints of CNNs each independently force a structured departure from the ideal.

Locality forces each pixel's dynamics to depend only on the local posterior over training patches, rather than the global posterior over whole images. Equivariance, arising from weight sharing, removes each pixel's knowledge of its absolute position, compelling it to match training patches from any location. Combining both constraints yields the equivariant local score (ELS) machine, in which each pixel independently selects the best-matching patch from any location in any training image. The result, during the reverse diffusion process, is an image assembled from a mosaic of locally consistent patches from heterogeneous sources — novel in global composition, coherent in local detail.1

A striking corollary is that the well-documented failure mode of diffusion models — the generation of anatomically impossible images with incorrect numbers of limbs or digits — is a direct prediction of the theory. Local patches are individually plausible, but they carry no global anatomical commitment; distant patches are free to disagree about the underlying object's structure. This is not a bug requiring correction but a logical consequence of the patch-mosaic mechanism.1

Human language structure as the source

In CNNs, the source of locality is unambiguous: it is the architecture. The receptive field is a hard structural boundary, enforced regardless of training. In transformer-based LLMs, full self-attention removes this constraint entirely — every token can attend to every other token within the context window.2 This means that if locality is operating in LLMs, its source must lie elsewhere. We argue it lies in the training data — specifically, in properties of human language that predate any machine learning system.

Natural language is not local in the strong sense of forbidding long-range dependency. It is local in the weaker sense that many syntactic relations and much next-token predictive information are concentrated in short spans, while discourse, reference, and reasoning can impose genuinely long-range commitments. An LLM trained on human language inherits this bias.

We distinguish three evidential levels. The first is the primary causal claim: human language is structured with a locality bias for reasons of cognitive efficiency that have nothing to do with LLMs. The second and third are consequences: LLMs absorb this structure and manifest it in their generative behaviour.

The primary claim — Human languages minimise dependency length. Using parsed corpora of 37 typologically diverse languages, Futrell, Mahowald & Gibson provide large-scale cross-linguistic evidence that dependency lengths — the distances between syntactically related words — are shorter than chance in every language examined.11 This dependency length minimisation is well-motivated by processing efficiency: shorter dependencies reduce memory load during parsing and production, and the hypothesis has been invoked to explain many of the most striking recurring properties of human languages. The result does not establish that language is local in a hard sense, but it establishes that the object being modelled is organised by pressures that concentrate syntactically relevant relations nearby — before any neural network encounters it. An LLM trained on such a corpus does not choose this locality structure; it inherits it.

First consequence — Predictive information concentrates in recent context. Khandelwal et al. showed that neural language models are highly sensitive to nearby tokens and much less sensitive to ordered structure in the distant past: beyond roughly 200 tokens, distant context behaves more like diffuse topical signal than fine-grained sequential structure.12 Someya et al. formalise this with m-local entropy — an information-theoretic measure of how effectively the preceding m symbols disambiguate the next symbol — and show that languages with higher m-local entropy are measurably harder for both Transformer and LSTM language models to learn.13 This is the primary claim showing up in the model: the locality structure of human language leaves a measurable signature in how much information LLMs can extract from nearby versus distant context.

Second consequence — LLMs underuse distant context even when available. Liu et al. find that long-context language models do not robustly exploit relevant information across positions: performance is strongest when key information appears near the beginning or end of the context, and degrades substantially when it falls in the middle.10 The architecture does not force this — full attention is available. The training signal does: a model trained on dependency-minimised language develops a strong prior toward local resolution of uncertainty, and that prior persists even when global information is nominally accessible.

Together these three levels describe a causal chain: human language is structured with a locality bias (primary claim) → LLMs trained on it develop correspondingly local inductive biases (first consequence) → those biases persist in generation even when the architecture does not enforce them (second consequence). We call the result inherited locality — locality whose source is the training data rather than the network topology. Unlike CNN locality, inherited locality is dynamic (its scale varies with the type of prediction), heterogeneous across attention heads,3 and content-dependent. These are real differences from the architectural case, and they are why the conjecture remains soft.

One hard architectural feature does contribute: the causal masking of autoregressive LLMs makes the future permanently inaccessible — locality in the temporal direction, asymmetric but firm. Relative positional encodings such as RoPE4 introduce a position-generalising bias that is a partial analogue of CNN translational equivariance — sharing the shift-relational property but not the global weight-tying that makes the CNN case analytically clean. The analogy should not be overstated.

The sequential patch-mosaic conjecture

We propose that effective locality, if it operates as described above, could drive a sequential analogue of the spatial patch mosaic. During autoregressive generation, the model at each step faces a prefix that almost certainly did not appear verbatim in the training corpus. It must generalise — and one natural description of how it does so is: matching locally consistent spans from the training distribution to the current context, then extending from those.

A critical step that the spatial case does not require — but the linguistic case demands — is specifying what the patch unit is. In images, a patch is a fixed rectangular region of pixels; its boundaries are unambiguous. In text, no equivalent natural decomposition exists. Candidate units include: individual tokens or n-grams; syntactic phrases or clauses; discourse-level moves (argument, elaboration, contrast); latent template clusters in embedding space; or nearest-neighbour continuation regions identified by retrieval. The right unit may not be fixed but may vary by task, domain, and position in generation. This underspecification is a genuine weakness of the conjecture as currently stated, and resolving it is a prerequisite for any formal theory.

With this caveat registered, the conjecture runs as follows: the generated sequence is assembled from locally consistent textual spans drawn from heterogeneous regions of the training corpus. Adjacent spans must agree on syntactic and semantic continuity; distant spans are free to originate from entirely different sources and genres. This process is combinatorially vast — and constitutes the model's linguistic creativity by analogy with the spatial mechanism identified by Kamb & Ganguli, if the analogy holds.

Dimension CNN diffusion models Large language models
Source of locality Network architecture (independent of training data) Training data — inherited from the statistical structure of human language
Locality type Architectural (fixed receptive field) Functional (statistical signal concentration)
Locality scale Uniform, layer-determined Dynamic, content- and head-dependent
Hard constraint Translational equivariance (convolutional weight sharing); finite receptive field Causal masking (architectural, hard); RoPE distance decay (inductive bias, soft — partial analogue only)
Patch unit Spatial image patch (fixed, receptive-field-defined) Undefined; candidate units include token n-gram, syntactic phrase, clause, discourse move, latent cluster, retrieval span
Mosaic coherence Overlapping pixel agreement Syntactic & semantic continuity
Failure mode Extra fingers (local coherence, global anatomy failure) Long-range consistency failure (referential drift, logical contradiction, narrative incoherence)
Creativity mechanism Novel spatial combinations of training patches Novel sequential combinations of training spans
Table 1 | Structural comparison of locality mechanisms. Comparison of locality as it operates in convolutional diffusion models (established analytically1) versus large language models (conjectural). The deepest asymmetry is the source row: CNN locality is architectural and would operate on any training data; LLM locality is proposed to be inherited from the statistical structure of human language itself. Five disanalogies are addressed in the main text.

Predictions and implications

If the sequential patch-mosaic conjecture is correct, it generates several testable predictions. We state them conditionally, since the underlying mechanism is not yet established.

Scaling of local versus global coherence. As model scale and training data increase, local fluency should improve monotonically — more patches, more finely calibrated local selection. Global coherence over long documents should improve more slowly and depend on architectural features that extend the effective locality scale (such as longer context windows or sparse attention mechanisms). This asymmetry is broadly consistent with observed LLM behaviour but constitutes a stronger, mechanistic prediction that can be tested by systematically varying context window size while holding model scale constant.

The training-dynamics window. Bonnaire et al.5 identify two timescales in diffusion model training: τ_gen, at which quality generation begins, and τ_mem, at which memorisation emerges, with τ_mem scaling linearly with dataset size. We conjecture — but do not establish — that an analogous phenomenon operates in LLM training. This transfer is not straightforward: diffusion training dynamics and autoregressive training dynamics differ in important respects, and the Bonnaire et al. result applies to the optimally-trained continuous-score setting, not the discrete next-token-prediction objective. The prediction is included as a candidate conjecture that requires independent investigation.

Attention architecture and effective locality. Sliding-window attention architectures6 deliberately reintroduce hard locality constraints structurally similar to CNN receptive fields. If the hypothesis is correct, such models should exhibit greater analytic tractability and a sharper version of the patch-mosaic mechanism at the cost of reduced long-range dependency modelling. Comparing full-attention and sliding-window models on local fluency versus document-level consistency metrics — holding all else constant — would constitute a direct test.

A partial failure-mode parallel. Diffusion models generate anatomically inconsistent images because excessive locality at late generation steps allows distant patches to disagree about global structure.1 LLMs exhibit long-range inconsistency — plot drift, referential failure, logical contradiction across long arguments — that is consistent with a patch-mosaic account. We note, however, that this consistency is not strong evidence: long-range failure in LLMs has multiple plausible causes including position bias, context underuse, decoding artefacts, and alignment training effects.10 The failure-mode parallel motivates the hypothesis but does not confirm it; distinguishing a patch-mosaic explanation from these alternatives requires controlled experiments.

What makes drift relevant here is not simply that it occurs, but that it typically preserves local fluency while compromising global commitment — exactly the asymmetry a locality-based account would predict. LLM drift is not merely a generic weakness in long-form generation; it is a diagnostic signature consistent with effective locality, in which nearby constraints are satisfied while distant commitments degrade as they fall outside the model's effective coherence radius. Drift therefore belongs in the category of behavioural evidence for a bounded effective locality scale — not proof of the full patch-mosaic mechanism, but an especially intuitive behavioural signature that makes the conjecture worth testing.

Where the analogy breaks

Intellectual honesty requires stating the disanalogies as clearly as the analogies. There are five respects in which the CNN–LLM comparison is imperfect or incomplete.

Architectural versus inherited locality. This is the deepest asymmetry and the one the new framing introduces most sharply. In CNNs, locality is a structural property of the network: it exists regardless of what the network is trained on, and it would produce patch-mosaic behaviour even on a training set with no locality structure whatsoever. In LLMs, we are proposing that locality is a property of the training data, absorbed through learning. This means LLM locality is contingent — it could in principle vary with the training corpus, could be partially overridden by fine-tuning, and might differ across languages. The CNN case is clean precisely because locality is architectural; the LLM case is soft precisely because it is not.

Hard versus soft locality. In Kamb & Ganguli, locality is a hard structural constraint from which closed-form analytic predictions follow with r² > 0.94. In LLMs, inherited locality is a statistical tendency with important exceptions. Tasks involving copying, retrieval, formal logical derivation, and long-range narrative consistency may depend substantially on distant context.10 A theory built on soft locality cannot achieve the same analytic precision, and overstating its hardness would be a category error.

The patch unit is undefined. The spatial patch in the CNN setting is a well-specified object: a rectangular region of pixels with a size determined by the receptive field. No equivalent natural unit exists in language. Until an empirically grounded patch unit is identified — whether token n-gram, syntactic phrase, latent cluster, or retrieval-based span — "sequential patch mosaic" remains an evocative metaphor rather than a formal mechanism.

Equivariance is only partially analogous. CNN translational equivariance arises from convolutional weight sharing and is global: every weight in the network participates in enforcing it. RoPE introduces relative-position sensitivity and distance decay within the attention kernel, but does not share weights across positions in the same way. This is a family resemblance, not an isomorphism.

Attention is not a perturbation. In the CNN case, attention is an add-on to an otherwise local architecture, and its non-local contribution can be measured as a gap between the ELS machine prediction (r² ~ 0.94) and the attention-enabled UNet (r² ~ 0.77). In LLMs, attention is the primary mechanism. There is no attention-free baseline against which to measure the non-local contribution. Any locality-based theory of LLMs must account for, not subtract, attention.

An empirical research programme

The conjecture becomes scientifically productive only if it generates falsifiable predictions that distinguish it from alternative accounts. Because the new framing locates the source of locality in human language structure rather than network architecture, it generates one prediction that the purely architectural account cannot: a cross-linguistic signature. We outline five experimental directions.

Cross-linguistic creativity comparison. Human languages vary in dependency structure and m-local entropy. If inherited locality shapes creativity via the patch-mosaic mechanism, LLMs trained on languages with weaker short-range predictive concentration should exhibit measurably larger effective locality radii, and may show different creativity signatures — more globally committed generation, less local-patch recombination. This is the sharpest prediction the new framing affords, and it is directly testable using existing multilingual training infrastructure and Someya et al.'s m-local entropy framework.13

Effective context radius estimation. Measure loss or next-token probability as context is truncated at increasing distances from the current position. If inherited locality governs generation in the manner we propose, loss should degrade sharply within a characteristic radius and plateau for more distant truncations in most settings — with systematic exceptions for retrieval-dependent tasks. Comparison of these radii across model families, scales, and training languages would directly test whether the locality radius tracks the linguistic locality structure of the training corpus.

Patch-source attribution. For a generated span, identify whether its nearest-neighbour support in the training corpus originates from a single contiguous source or from multiple distant sources. A patch-mosaic mechanism predicts the latter: different portions of the generated output should trace to heterogeneous training regions. This test requires a retrieval index over the training corpus and is sensitive to the choice of patch unit — making it simultaneously a test of the mechanism and a method for operationalising the patch. We note that the training corpora of most frontier LLMs are unavailable or only partially reconstructible; this experiment is therefore most immediately feasible in controlled-corpus, open-data, or synthetic-training settings, and should be scoped accordingly.

Full-attention versus sliding-window comparison. Compare architecturally matched full-attention and sliding-window models on local fluency (sentence-level perplexity, n-gram coherence) and document-level consistency (coreference accuracy, plot consistency scoring, logical entailment over long arguments). The patch-mosaic hypothesis predicts that sliding-window models show comparable local fluency but greater systematic failure on long-range consistency tasks — a pattern distinct from a simple capacity account.

Training-time memorisation onset. Test whether LLM training exhibits a τ_gen / τ_mem structure analogous to Bonnaire et al.5 by measuring memorisation rate and generation quality as a function of training step and dataset size. If τ_mem scales with dataset size while τ_gen does not, this would support a common training-dynamics picture across model classes — though the mechanism connecting that picture to the patch-mosaic conjecture would require separate argument.

Conclusion

We have proposed that the creativity of large language models may be rooted not in their architecture but in the structure of what they are trained on. Human languages, shaped by cognitive pressures toward communicative efficiency, exhibit a systematic locality bias: syntactic dependencies are shorter than chance, predictive information concentrates in recent context, and models trained on such language absorb this structure as an inductive bias. We conjecture that this inherited locality may help shape a sequential patch-mosaic mechanism in LLMs — the linguistic analogue of the spatial creativity mechanism identified by Kamb & Ganguli in convolutional diffusion models.1 We present this as a candidate organising principle and a falsifiable research conjecture, not as a demonstrated mechanism.

The deepest disanalogy with the CNN case is also the most generative: because LLM locality is inherited rather than architectural, it should vary with the training corpus. LLMs trained on languages with different dependency structures should exhibit measurably different effective locality radii — and potentially different creativity signatures. This cross-linguistic prediction is the sharpest consequence of the new framing, and the one most clearly unavailable to any account that locates the source of creativity in model architecture alone.

The conjecture becomes most valuable if it generates experiments that would not otherwise have been run. The empirical programme outlined above — cross-linguistic comparison, effective context radius estimation, patch-source attribution, sliding-window comparison, and training-time memorisation onset — is designed with that goal. The payoff of a successful theory would be substantial: a mechanistically grounded account of how the evolutionary history of human language, and the cognitive pressures that shaped it, continue to operate inside the generative models trained on its products.

References

  1. Kamb, M. & Ganguli, S. An analytic theory of creativity in convolutional diffusion models. Preprint at arXiv https://arxiv.org/abs/2412.20292 (2024). ICML 2025 Oral.
  2. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
  3. Clark, K., Khandelwal, U., Levy, O. & Manning, C. D. What does BERT look at? An analysis of BERT's attention. Proc. 2019 ACL Workshop BlackboxNLP, 276–286 (2019). Note: this study examines an encoder (BERT), not an autoregressive decoder; cited as suggestive evidence for head-level locality heterogeneity pending direct study in decoder models.
  4. Su, J. et al. RoFormer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024).
  5. Bonnaire, T., Urfin, B., Biroli, G. & Mézard, M. Why diffusion models don't memorize: the role of implicit dynamical regularization in training. Preprint at arXiv https://arxiv.org/abs/2505.17638 (2025).
  6. Jiang, A. Q. et al. Mistral 7B. Preprint at arXiv https://arxiv.org/abs/2310.06825 (2023).
  7. Austin, J. et al. Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 34, 17981–17993 (2021).
  8. Biroli, G., Bonnaire, T., de Bortoli, V. & Mézard, M. Dynamical regimes of diffusion models. Nat. Commun. 15, 9957 (2024).
  9. Kadkhodaie, Z., Guth, F., Simoncelli, E. & Mallat, S. generalisation in diffusion models arises from geometry-adaptive harmonic representations. Preprint at arXiv https://arxiv.org/abs/2310.02557 (2023). ICLR 2024 Oral.
  10. Liu, N. F. et al. Lost in the middle: how language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173 (2024).
  11. Futrell, R., Mahowald, K. & Gibson, E. Large-scale evidence of dependency length minimisation in 37 languages. Proc. Natl. Acad. Sci. U.S.A. 112, 10336–10341 (2015).
  12. Khandelwal, U., He, H., Qi, P. & Jurafsky, D. Sharp nearby, fuzzy far away: how neural language models use context. In Proc. 56th Annual Meeting of the Association for Computational Linguistics, 284–294 (Association for Computational Linguistics, 2018).
  13. Someya, T. et al. Information locality as an inductive bias for neural language models. In Proc. 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 27995–28013 (Association for Computational Linguistics, Vienna, 2025).