\chapter{AI as the Infrastructure of Modulation}\label{cha:ai}
\glsresetall

%%%%%%%%%\epigraph{[...] descending into the hidden abode of production means something else in the digital age. It means that we must also descend into the somewhat immaterial technology of modern-day computing, and examine the formal qualities of the machines that constitute the factory loom and industrial Colossus of our age. The factory was modernity’s site of production. The \enquote{non-place} of Empire refuses such an easy localization. For Empire, we must descend instead into the distributed networks, the programming languages, the computer protocols, and other digital technologies that have transformed twenty-first-century production into a vital mass of immaterial flows and instantaneous transactions. Indeed, we must read the never ending stream of computer code as we read any text (the former having yet to achieve recognition as a \enquote{natural language}), decoding its structure of control as we would a film or novel.}{\citeauthorfull{galloway2001} \cite*[82]{galloway2001}}


\epigraph{I am less certain about treating machine learning as automation. Learning from data [...] often sidesteps and substitutes for existing ways of acting, and practices of control, and it thereby reconfigures human-machine differences. Yet the notion of automation does not capture well how this comes about. The programs that machine learners "write" are formulated as probabilistic models, as learned rules or association, and they generate predictive and classificatory statements ("this is a cat"). If this transformed calculability is automation, then we need to understand the specific contemporary reality of automation as it takes shape in machine learning. We cannot conduct critical enquiry into how calculation will automate future decisions without putting the notions of calculation and automation into question.}{\citeauthorfull{mackenzie2017} \cite*[7-8]{mackenzie2017}}

The previous chapter examined how the critique and resistance might be articulated under the institutional dynamics of control and its processes of subjectification. Management of information is central to those infrastructures and
the \gls{ai} systems have an increasingly prominent role in the governance of information. Contemporary infrastructures analyse digital traces to generate personalised recommendations, assess relevance between users and content across search engines and social media platforms, and, with the emergence of \gls{genai} and \glspl{llm} (see Figure~\ref{fig:ai_domains} for a guiding illustration to position different \gls{ai} domains and products mentioned throughout the chapter), produce text, code, images, and other media. What were once predictive or classificatory instruments have become systems capable of synthesising and reorganising knowledge at scale. Their outputs now participate directly in communication, cultural production, and decision-making. Advances in \gls{nn} research and transformer architectures have been decisive in enabling these developments. \Gls{genai} models do not operate solely by identifying statistical regularities; they synthesise linguistic and visual material by drawing on extensive training corpora. Through such recombinations, they participate in the formation, circulation, and reinterpretation of human knowledge. To understand how such architectures participate in the production of subjectivity, we must first trace the evolution of \gls{ai}, from early symbolic reasoning to statistical modelling, \gls{dl}, and the emergence of self-attentive transformer architectures.


\begin{marginfigure}
	\includegraphics[width=\textwidth]{images/ai_domains.png}
	\caption{An illustration of overlap and interplay between \gls{ai} domains leading to
		the \glspl{llm} such as ChatGPT (cf. \cite[47]{alomari2024})}
	\label{fig:ai_domains}
\end{marginfigure}

The present chapter focuses on the technical history and development of \gls{ai} in order to understand how these models acquire their representational and generative capabilities in a chronological and conceptual methodology. It begins by outlining the trajectory of \gls{ai} research, distinguishing between the symbolic paradigm (\gls{symai}) and the statistical approaches that underpin \gls{dl} and \gls{genai}. This includes a discussion of \glspl{nn}, \gls{ssl}, and transformers as the technical backbone of modern \gls{genai} models. Beyond description, this technical overview serves a strategic purpose: it shows how \gls{ai}, even in its architectures, encodes specific logics of inference, representation, and control. These are not neutral design choices but material conditions that enable \gls{ai} systems to act as infrastructures of knowledge production, decision-making, and governance. The chapter, therefore, provides both the technical foundations and the conceptual scaffolding for analysing the potentiality of \gls{genai} as a distributed, non-symbolic agent of control.

\section{From Symbolic Rules to Statistical Inference: A Brief History of \gls{ai}  and \gls{nlp} }\label{sec:ai_history}

\Gls{nlp} is an area that lies at the intersection of linguistics, computer science, and \gls{ai}, aiming to create computational systems that can interpret and handle human language data. It has been the ground for most of the breakthroughs in \gls{ai} development, especially in recent years (see \cite[22ff]{bommasani2022a} for a detailed analysis of the history of \gls{ai} and language). Considering that, in some respects, the cognitive performance of an individual human is hardly superior to that of other primates \parencite[127]{manning2022a}, especially in short-term memory, it is hardly a surprise that the groundbreaking advancements in artificial pursuit of a mind happened on these shores. The transformative power of language has enabled \textit{Homo sapiens} to link individual minds into collective networks of cognition. Language, rather than individual brainpower, constitutes the machinery through which human intelligence scales, distributes, and accumulates collectively \parencite[127]{manning2022a}.

\Gls{ai} emerged in the mid-20\textsuperscript{th} century, grounded in the formal logics of symbolic representation. The foundational paradigm, now referred to as \gls{symai} or \gls{gofai}, conceived intelligence as a matter of symbolic reasoning over explicitly encoded rules. The early paradigm treated intelligence as a computational process operating over discrete symbols according to explicitly programmed rules. \Gls{ai} systems under this logic were built to emulate deductive reasoning and problem-solving. The assumption was clear: if the world could be faithfully translated into a logical schema, machines could infer, deduce, and act rationally (see \cite[183]{eloff2021}). \textcite{manning2022a} defines the \textbf{first era} between 1950 and 1969 as a development process under the immense lack of
knowledge about the structure of human language or \gls{ml} and \gls{ai}. The
1956 Dartmouth Conference institutionalised the ambitions by defining \gls{ai} as
\enquote{the science and engineering of making intelligent machines}
\parencite[195]{montanari2025}. Early research during this period was primarily focused on narrow, rule-based systems, particularly word-level translation lookups and simple mechanisms to handle inflectional forms and word order \parencite[128]{manning2022a}.
In parallel, Alan Turing made substantial contributions by introducing the famous \enquote{Turing Test} (or \enquote{Imitation Game}), designed to evaluate a machine's ability to imitate human intelligence and rationality, along with the foundational concept of a universal machine (see \cite[196]{montanari2025}). As Cognitive Robotics Prof. Murray Shanahan \parencite*[see][]{googledeepmind2025} and Meta's Chief AI scientist Yann LeCun \parencite*[see][]{lexfridman2024} emphasise, the \textit{Turing Test} is an inadequate benchmark for assessing modern \gls{ai} models, but Turing's ideas nonetheless contributed to the conceptual foundation of the \enquote{prompt-based conversational machine} \parencite[196]{montanari2025}. Aligned with Turing's perspective, the underlying notion in the early imaginary of a future \gls{ai} was simple; if a machine could convincingly imitate a human in conversation, it would have been considered intelligent.

Relying on handcrafted rule sets meant implicit definitions of the features regarding the object of interest; for instance, to recognise patterns, the digit six in an image, one might encode the features \enquote{a closed loop at the bottom} and \enquote{a curve rising to the right}. Such symbolic heuristics were sufficient so long as the data was clean and the context unambiguous. In the \textbf{second era} of \gls{ai} development, spanning roughly 1970 to 1992, these approaches were extended to more complex domains, most notably natural language. By attempting to formalise aspects of linguistic structure and meaning, researchers pushed the boundaries of rule-based systems. While these models demonstrated greater sophistication in handling linguistic patterns, they still relied on explicitly encoded knowledge and remained limited by the inherent rigidity of symbolic architectures \parencite[129]{manning2022a}. Yet, these systems could not generalise beyond predefined rules. When confronted with noise or shifting contexts, their logic collapsed. The result was a period of stagnation and disillusionment now remembered as the \enquote{AI Winters} between 1970 - 1980 \parencite[183]{eloff2021}.

But real-world ambiguity proved hostile to symbolic systems. As \gls{symai} attempted to scale into more complex domains like vision or language, it revealed its brittleness \parencite[183--184]{eloff2021}. Philosophers of phenomenology were early critics of this paradigm, following Hubert Dreyfus' (\cite*{dreyfus2009})\sidenote{Originally published in 1972.} earlier work where he argued that human intelligence was not symbolic, but embodied, situated, and fundamentally non-representational. Despite such critiques, \gls{symai} dominated the earlier decades of research in \gls{ai} fields. This rationalist framework aligned with early cognitive science’s attempts to model the mind as a rule-based machine of symbolic representation (see \cite[194--197]{montanari2025}). \gls{dg} were also one of the critiques, the hierarchically structured learning and the projection of a central pattern were clearly not working well:

\begin{quote}
	This is evident in current problems in information science and computer science, which still cling to the oldest modes of thought in that they grant all power to a memory or central organ. Pierre Rosenstiehl and Jean Petitot, in a fine article denouncing \enquote{the imagery of command trees} (centered systems or hierarchical structures), note that \enquote{accepting the primacy of hierarchical structures amounts to giving arborescent structures privileged status.... The arborescent form admits of topological explanation.... In a hierarchical system, an individual has only one active neighbor, his or her hierarchical superior.... The channels of transmission are preestablished: the arborescent system preexists the individual, who is integrated into it at an allotted place} (signifiance and subjectification).

	\citereset
	— \cite[16]{deleuze1987}
\end{quote}

\Gls{dg}’s critique of early \gls{ai} approaches centred on their rejection of hierarchical and centralised models, which constitute one of the main pillars of their project. Their affirmative alternative was grounded in a connectionist and non-hierarchical understanding of thought \parencite[3ff.]{deleuze1987} as in opposition to arborescent or \textit{tree} structures. While their critique targeted the symbolic, rule-based systems of their time, it is striking how closely their vision anticipated the architectural principles underpinning contemporary \gls{ai} on a general level, particularly in its distributed, associative, and layered formations.\sidenote{Arguably, also regarding non-hierarchical functioning of the \glspl{nn};
	however, it is still a matter of discussion, if the \gls{genai} model
	architectures deploy a continuous subordination between different patterns and
	distributions. See the following sections for further discussion.} Nonetheless, the technological trajectory towards such architectures would take decades to materialise, revealing the prescient force of their philosophical intervention.
The critique \gls{dg} (also) raised is arguably the foundation of the
breakaway from the \enquote{IF...THEN...} logic in the future. Which is one of the
circling analogical themes in the debate around algorithmic governmentality (see
Section \ref{sec:crit_res}), where the analogy between programmable procedures and social regulation remains a central concern. Considering that the non-symbolic and connectivist structure of contemporary \gls{ai} systems is one of the strongest motivations for this study to move beyond theories of algorithmic governmentality and reflections that still treat \gls{ai} as if it operated through \enquote{IF…THEN…} architectures. It is also the reason for arguing that current debates can benefit from engaging \gls{dg}’s broader work beyond the \emph{Postscript}.

Following \gls{ai} development aimed to overcome the shortcomings of symbolic sequences, and to find paths towards architectures without explicit definitions of instructions. The \textbf{third era} from roughly 1993 to 2012, was signified with the beginning of
the abundance any novel \gls{ai} innovation lacked the most, \emph{the data}. As the internet boom suddenly introduced a massive digital corpus, researchers shifted toward statistical learning, leading to the rise of data-driven \gls{nlp}. This shift replaced hand-coded rules with empirical models trained on annotated examples \parencite{maas2023}; models could now generalise from data rather than deduce from explicitly defined axioms. Initially, the dominant approach centred on relatively simple statistical techniques applied to modest amounts of text, often in the low tens of millions of words. Researchers extracted linguistic facts from these corpora, identifying regularities such as common collocations or syntactic structures. Yet, early attempts to model language understanding through these means remained limited in their ability to capture deeper semantic or contextual knowledge (see \cite[129]{manning2022a}). For instance, early statistical models revealed that certain types of words tended to appear together, names of places often occurred alongside personal references, while more abstract terms exhibited distinctive distributional patterns. However, such surface-level regularities provided only limited insight into the deeper structures of language. As it became evident that simple frequency-based methods were insufficient for capturing the complexity of linguistic meaning, the focus shifted toward building annotated linguistic resources, such as syntactic treebanks, lexical databases, and labelled datasets for named entity recognition. These resources formed the foundation for more reliable, supervised learning approaches (see \cite[129]{manning2022a}). Onwards, the general purpose \gls{ai} development continued with ups and downs in activity, with a couple of earlier successful neural network-based approaches like Mulloch-Pits. Among the early milestones was ELIZA, a rule-based program that mimicked a psychotherapist by matching keywords to scripted responses. Despite its simplicity, ELIZA gave the illusion of understanding and demonstrated the potential of machine conversation; though its developer emphasised it was merely parodic \parencite{toloka2023}. Still, it signalled the beginning of natural language interaction with machines, laying the groundwork that statistical and later neural methods would build upon. Up until around 1997, much more advanced models like Deep Blue operating on more sophisticated architectures like the early attempts on \glspl{dnn} were developed \parencite[197]{montanari2025}, but the main meta of the \gls{ai} development was highly dependent on labelled data, and \gls{sl}.


Although the real transformation originally began in the early 2000s, the first significant fruits of the new direction dropped around 2013, which marks the \textbf{4. and current era} in \gls{ai} development \parencite[129]{manning2022a}. Pushing through the ability to process more and more data allowed a new paradigm to emerge, rooted in \glspl{nn} inspired by the architecture of the brain, \emph{connectionism} became the new meta of further advancements. These systems, now more broadly applied and clearly defined as \glspl{dnn}, learned not by logic but by adjusting distributed weightings across layered networks, which became the foundation for contemporary \gls{ml} and \gls{dl} systems. Exponential advances in computation enabled these networks to scale \parencite[184]{eloff2021} and finally also pushed towards an \gls{ul}
methodologies, whereas the models were geared towards recognising patterns in the data without being explicitly told which features of the data were pointing to what. For instance, while early augmentational models were trying to distinct between cat and dog photos by looking at photos labeled by humans and other processes as either as \textit{dogs} or \textit{cats}, \gls{ul} models are looking at a data collection of unlabeled photos and try to find patterns in them which makes both parties distinct through specific characteristics, in other words, towards finding out about the substance of \textit{dogness} and \textit{catness}. On the \gls{nlp} fronts, linguistic units such as words or sentences came to be represented as vectors in high-dimensional vector spaces. Semantic and syntactic relationships were modelled not through rule-based analysis and pre-defined categories, but through the spatial proximity of these vectors \parencite[129]{manning2022a}. \Gls{dl} allowed to parse distant context, as well as processing the words meaningwise close to each other, thanks to this generalised vector space approach optimised with more and more textual data (see \cite[129]{manning2022a}). This approach turned out to be far more effective than earlier attempts at formalising linguistic meaning. Instead of hand-coding grammatical rules or manually annotating small corpora, models could now process large textual datasets and infer structure statistically. \Gls{dl} enabled systems to capture long-range dependencies in context and identify meaning-level relationships through learned representations optimised across massive datasets. Crucially, this reduced the need for manual labelling, as \gls{ul} techniques became dominant.


One of the most significant turning points was around 2018 with the successful implementation of the \gls{ssl} approach. \Gls{ssl} constitutes a special case of the \gls{ul}, which not only makes the models identify underlying structures in the data but also enables them to create their own training exercises through the prediction challenges they are subjected to \parencite[129]{manning2022a}. This includes masking specific words in the text to try to predict the correct or most fitting \glspl{token}, or try to guess the next word in an abruptly cut text, where \gls{ssl} models learn by predicting missing elements from within the input itself. This method allowed models to learn linguistic regularities from massive unlabeled corpora, and it gave rise to pre-trained \gls{genai} models \parencite{maas2023}. The novelty that specifically enabled this leap was the \textit{transformer architecture}. Its core mechanism, self-attention, computes weighted dependencies between all tokens in a sequence, allowing the model to capture long-range relations independent of word order. This innovation enabled massive parallelisation and scalability \parencite{maas2023}. Availability of vast data and the unique novelty of transformer architecture that was powered by a huge amount of reinforcement capability through repetition has been crucial in operating on \gls{ssl} methodology to parse and accumulate huge amounts of unlabeled human language data.

\section{Mayan Codices and Telephatic Broadcasts: Algorithmic Governance of Information before \Gls{genai}
 }\label{sec:old_ai}\marginnote{From \citeauthorfull{burroughs1979}'s \parencite*[81]{burroughs1979} \citetitle{burroughs1979}, a relevant quote can be found below.}


The earlier \gls{ai} implementations on the web are mainly classified as
recommender systems, which associate relations between different content, and
filter accordingly. Their widely still relevant application has started with
the participatory internet culture, where users also became contributors,
for example, on social media platforms.
Krassmann notes that this transition rendered
humans rapid data generators for the training sets:

\begin{quote}
	Thus far, we have determined that whereas the individual and disciplinary power seem to be cast in the same mold – the former being the product of the latter – the digital subject of the control society 2.0 appears to be an active subject able to make decisions – which in turn feeds the algorithms.

	\citereset
	— \cite[19]{Krasmann2017}
\end{quote}

This insight offers a precise entry point into the history of \gls{ai}. Long before the emergence of \gls{genai}, \gls{nn} based \gls{ai} systems were integrated into infrastructures designed to sort, rank, and anticipate behaviour. Search engines, recommender systems, and ranking algorithms constructed profiles, inferred preferences, and organised interactions through relevance estimation (see \cite[26–30]{demir2019}). These systems already relied on a feedback-driven logic: user behaviour shaped algorithmic output, and algorithmic output shaped subsequent behaviour. When read through Deleuze’s diagram, such early \gls{nn} applications exhibit the operational dynamics of control; continuous capture, iterative adjustment, and subtle steering of conduct. Their core mode of operation can be summarised in the following loop:

\begin{enumerate}
	\item massive data collection from user interactions,
	\item indexing and probabilistic categorisation of behaviours,
	\item ranking and recommending content based on \textbf{relevance association},
	\item generating personalised information flows, recommendations,
	      associations,
	\item feeding back the gathered information into the user’s profile to update the personalised process (see Figure~\ref{fig:algosel} for an illustration of the process).
\end{enumerate}

\begin{figure*}[htbp]
	\includegraphics[width=0.85\textwidth]{images/AlSel_BA.png}
	\caption{Algorithmic Selection and Relevance Assignment Process (cf. \cite[241]{just2017})}
	\label{fig:algosel}
\end{figure*}

The anchoring process produced by the feedback loops between users and recommender systems (see the characterisation of anchors and endless while-loops in \cite[34–35]{demir2019}) established a correction mechanism based on the dividual traces of users, that is, the fragments of data assembled by algorithmic systems to construct \textit{profiles}. Each interaction became an input to a probabilistic model that then shaped the horizon of the next interaction. Platforms such as Facebook or YouTube did not need to coerce users; they governed behaviour through \textit{environmental modulation}, subtly reinforcing predictable patterns of attention and engagement \parencite[29–32]{demir2019}. \citeauthorfull{hui2015} names it as a process of \enquote{disindividuation}:

\begin{quote}
	Under the guise of being free and friendly to use, we can see in this example that the modulation of social relations can actually lead to what we have called ‘disindividuation’ [...] the attention of each social atom (or ‘person’) is sliced into ever smaller pieces and dispersed across networks via status updates, interactions, and advertisements. [...] The ‘collective’ on Facebook becomes a distraction, a cause of the dissolution of structures within individuals, but not a site of new modes of empowerment.

	--- \cite[90]{hui2015}
\end{quote}

The recommendation systems and the algorithmic governance of information reflect the process of dividualisation that characterises control societies (see also \cite{Cheney2011}); the coherence of personal or collective agency is fragmented into algorithmically analysed micro-traces of digital history. While this dynamic dissolves the unitary subject, the constant personalisation of digital reality enacts what Deleuze calls modulation: an ongoing, fine-grained adjustment of the individual’s field of experience.
From an infrastructural perspective, these algorithms already governed information flows and the subjectification process, creating a precondition for the transition to generative systems. Whereas these early models merely filtered, ranked, and nudged, contemporary \gls{genai} systems will move beyond governance of information toward its generation.

However, returning to the question of the nature of \textit{control}, it is worth asking whether the institutional mechanisms of control were ever meant to be generative in the first place? Were the computational methods associated with control societies ever intended to communicate with individuals rather than simply act upon their traces? And is adaptation, flexibility, and the articulation of statistical inference sufficient to classify \gls{genai} systems as \glspl{dispositif} of control?
If critique and resistance are to be reconsidered under these conditions, the novelty introduced by generativity becomes a central concern. What does it mean for systems of control to produce, generate, and respond, rather than only to filter and anticipate? As a satirical analogy for the limits of what constitutes control, \citeauthor{burroughs1979} offers a definition of biocontrol in \citetitle{burroughs1979}:

\begin{quote}
	The biocontrol apparatus is prototype of one-way telepathic control. The subject could be rendered susceptible to the transmitter by drugs or other processing without installing any apparatus. Ultimately the Senders will use telepathic transmitting exclusively\ldots Ever dig the Mayan codices? I figure it like this: the priests -- about one per cent of population -- made with one-way telepathic broadcasts instructing the workers what to feel and when\ldots A telepathic sender has to send all the time. He can never receive, because if he receives that means someone else has feelings of his own could louse up his continuity. The sender has to send all the time [...]

	\citereset
	— \cite[81]{burroughs1979}
\end{quote}


One can read \citeauthor{burroughs1979}’s description as a useful contrast for distinguishing generative systems from earlier applications of \gls{ai}. At first glance, the early \gls{nn}-driven platforms already align with the institutional description of control societies. They dissolved enclosures, operated through environmental cues, and extracted dividual traces from users; this constituted the paradigm of \emph{algorithmic governance of information}. The emergence of \gls{genai} creates a different constellation. These models do not simply modulate existing flows of information but \emph{generate} content, narratives, and knowledge formations that participate directly in the shaping of subjectivity; the machinery of governance, therefore, becomes a machinery of production. I argue that while these systems maintain a strong resemblance to the \glspl{dispositif} of control, their generative capacity introduces a degree of novelty for thinking about critique and resistance. Whether this development extends the logic of control or marks a qualitatively distinct mode of operation leads to the following tasks:

\begin{enumerate}
	\item to open the black box of \gls{genai} and its transformer-based architecture;
	\item to examine how these models mediate human agency and the production of meaning.
\end{enumerate}

\section{\acrfull{dl} and \acrfull{genai}}\label{sec:genai}

At their core, contemporary generative systems are \glspl{nn}. A \gls{nn} is a computational architecture inspired (loosely) by biological \glspl{neuron},  each \gls{neuron} receives inputs, applies weights and biases, passes the result through an activation function, and transmits the signal forward \parencite[for one of the fundamental papers, see][]{rosenblatt1958}. What distinguishes \glspl{nn} from earlier symbolic systems is not rule-following but function approximation. By adjusting millions or even billions of parameters during training, these architectures learn statistical mappings between inputs and outputs that cannot be written down as explicit rules \parencite[see][]{rumelhart1986, lecun2015}. \Gls{dl} extends this principle by stacking many such layers. Depth allows the network to build hierarchical representations: lower layers detect relatively simple features, while higher layers capture progressively abstract patterns (see Figure~\ref{fig:neural_network} for a simple illustration). Instead of storing
meaning in explicit symbols, meaning emerges from distributed patterns of activation spread across the network. This is what enables the modelling of highly non-linear relationships in data, crucial for handling the complexity of natural language, vision, and multimodal inputs \parencite[see e.g.][]{goodfellow2016, schmidhuber2015}.

\Gls{genai} models, particularly \glspl{llm}, are thus best understood as specialised deep \glspl{nn}. Instead of operating on predefined linguistic rules, they function by encoding massive distributions of textual patterns into weight configurations. Their generativity stems from this architecture; by sampling from learned distributions, they produce novel outputs aligned with the statistical structure of language. In this sense, the architecture itself is the key to their meaning-making capacities, their seemingly impressive way of binding distant concepts.


\begin{figure}[htbp]
	\includegraphics[width=\textwidth]{images/neural_network.png}
	\caption{A Simplified Illustration of a \gls{nn} (cf. \cite{subramaniam2019})}
	\label{fig:neural_network}
\end{figure}


But how does the generativity function? How, in fact, is meaning produced? Does the machinery itself offer clues about the nature of the content that \gls{genai} models generate? Beyond the corporations that develop the most sophisticated models, beyond the deliberately added specific configurations, there lies a common architecture animating these meaning-making systems. What remains, if not to look directly into the machine? Whether because of its complexity or a lack of potential for insights, this path is an especially unexplored one in the critical theories of the contemporary sociotechnological advancements. Following the discussion in the last chapter, exactly because of this specific lack, I am delving into an analysis of the specific features of the \gls{genai} models by specifically focusing on how \textbf{a transformed based \gls{llm}} produces outputs.


\subsection{Vector Spaces and Collapsing
	Dimensions}\label{sec:dimensionality_reduction}


After the breakthroughs in \gls{nn} architectures over the past decade (as outlined in Section~\ref{sec:ai_history}), many influential designs for sequence modelling, particularly in machine translation and \gls{nlp}, were based on \glspl{rnn} and \glspl{cnn}. Despite their advances over earlier implementations, these models faced a fundamental limitation often described as the problem of \textit{locality}: their difficulty in capturing long-range dependencies across sequences \parencite[see e.g.][]{bengio1994}. In \gls{ml}, data must be reshaped into a form the model can process. For \gls{nlp}, this requires vectorising language into a high-dimensional space where \glspl{token} are assigned coordinates and scales.\sidenote{Since a word is the most common form of a \gls{token} in \gls{nlp}, vectorisation means representing it as a vector \( (x_1, \dots, x_n) \), where each component \(x_i\) corresponds to a dimension in the embedding space. The number of dimensions \(n\) is fixed by the model’s architecture and determines how tokens can be compared and transformed. For instance, common embeddings use \(n=300\) dimensions in \textit{word2vec} or \(n=768\) in \textit{BERT} (see \cite{mikolov2013}).} Once embedded, relations between elements can be computed algebraically, allowing the model to operate within what \citeauthor{mackenzie2017}\sidenote{Since two different authors with the last name MacKenzie are cited in this paper, note the distinction between Iain Mackenzie, cited as \citeauthor{mackenzie2018}, and Adrian Mackenzie, cited as \citeauthor{mackenzie2017}.} \parencite*[51]{mackenzie2017}  calls an \enquote{expanding epistemic space}, where results emerge from geometric proximity and transformation.

Once vectorisation is performed, the next question is how to most effectively represent the resulting vector space. In its raw form, data often contains a very large number of features, which translate into dimensions that are too burdensome for models to handle directly. This motivates the development of \textit{dimensionality reduction} techniques. Far predating the rise of \gls{genai}, dimensionality reduction is a foundational method in \gls{ml} that projects high-dimensional data, such as raw image pixels or token embeddings, into a compressed latent space that is more tractable for statistical operations prior to training. These latent representations are not merely a technical convenience; they constitute the terrain upon which inference, generalisation, and generation take place. In this process, each data object, whether a sentence, an image, or a behavioural trace, is mapped onto a point or trajectory within a lower-dimensional space. The resulting representations emphasise the most \emph{distinctive} features relevant to the dataset as a whole.  In the contemporary sophisticated \gls{dl} models, analogous forms of dimensionality reduction occur within the intermediate layers of the network, since training requires the data to be represented at different levels of abstraction; these transformations do not necessarily reduce dimensionality and can at times expand it, but they nonetheless perform a comparable compressive or restructuring function, and to illustrate this more intuitively we can turn to earlier, more explicit implementations of dimensionality reduction in classical \gls{ml}. Indeed, dimensionality reduction methods such as \gls{pca} are often used to \enquote{flatten the vector space down into lower-dimensional subspaces} \parencite[73]{mackenzie2017}. This approach reduces complexity, highlights dominant patterns, and improves the efficiency of subsequent learning tasks (see e.g. \cite[1–9]{jolliffe2002}) with a trade-off of losing some information from the initial raw data. However, dimensionality reduction necessarily involves choices about which aspects of the data are preserved and which are discarded, and this selective compression underlies concerns about the representations that \gls{genai} models construct, since they are grounded in a reduced and fundamentally \textit{latent} reality (see Chapter~\ref{sec:latency}).

\begin{figure}
	\begin{center}
		\includegraphics[width=0.7\textwidth]{images/dimensionality_reduction300.png}
	\end{center}
	\caption{Dimensionality Reduction via Principal Component Analysis, Image
		Reconstruction out of 20 Principal Components, and Feature Importance
		Visualisation using Olivetti Faces Dataset (dataset:
		\cite{attlaboratoriescambridge2005}, implementation: author's self
		work, see Annex~\ref{cha:dimensionality_reduction}.)
	}\label{fig:dimensionality_reduction}
	\forcerectofloat
\end{figure}

Dimensionality reduction might be hard to visualise in the case of text data,
but image recognition models often deliver better insight into the operation.
See the example in Figure~\ref{fig:dimensionality_reduction}:
it shows snapshots from the training of a simple image recognition model on the Olivetti Faces dataset (see \cite[]{attlaboratoriescambridge2005}; a collection of standardised, grayscale portraits of 40 individuals). On the first row, there are portraits of different subjects in the dataset. On the second row, we see five random principal components obtained from the dimensionality reduction operation via \gls{pca}\sidenote{On the mathematical level, these correspond to the \textit{eigenvectors} of the sample covariance matrix $X^TX$. Each eigenvector points in the direction of maximal variance, and the associated eigenvalue measures the strength of that variance.}, which can be thought of as the building blocks the model uses to (re)construct faces in a more compact representation. In the third row of the figure, we see the same five faces from the first row, but reconstructed using 20 principal components extracted in the \gls{pca} process.\sidenote{This reconstruction through 20 principal components corresponds to an explained variance of roughly 70\% (see Annex~\ref{cha:dimensionality_reduction}), meaning that the majority of the dataset’s information content is retained even after compression.}
The reconstruction is shaped by the stronger features across the dataset: the new faces blend the traits that make faces distinctive, while highly individual features survive only insofar as they represent strong divergences in the dataset as a whole. See, for example, how the reconstructed images also contain features from other faces; one of the most distinctive examples of this is that all of the reconstructed images now feature some resemblance to glasses. The reconstruction is a reimagination of faces by using the most distinctive aspects of all of the faces.

Finally, the last visualisation displays the locations of the most important features for the model on each pixel, with lighter pixels indicating higher importance. As one can observe by looking at the lightest pixels, these are the most emphasised parts of the reconstructed images while still being the parts where some \textit{ghost} features (like the outlines of the glasses example) blend the most.
The model is therefore more likely to preserve and emphasise those features that play a distinctive role across samples. In other words, the registration of the data is transformed through a reconstruction guided by these selected features. Although this demonstration is based on images, the same logic underlies dimensionality reduction in \gls{genai} contexts. When applied to text, the \enquote{faces} become words and contexts, and the principal components become latent dimensions of meaning. Just as the Olivetti reconstructions compress facial features into a tractable subspace, \glspl{llm} compress linguistic variation into latent vectors, privileging what is most statistically distinctive while likely discarding subtle or marginal patterns. This illustrates how latent representations, whether of faces or words, are always a reduced lens on reality; efficient and powerful, but partial.


The logic of dimensionality reduction illustrates how high-dimensional data can be compressed into tractable latent spaces that retain the most distinctive features of a dataset. In the domain of sequence modelling, this logic was taken up in early neural architectures such as \glspl{rnn} and \glspl{cnn}, which typically followed an encoder–decoder design. These architectures operationalised the principle of latent representation: the encoder compressed an input sequence into a continuous vector space, and the decoder expanded this representation into an output sequence.\sidenote{In its simplest form, the encoder
	takes an input sequence of symbols \( (x_1, \dots, x_n) \) and transforms them
	into a sequence of continuous vector representations
	\( \mathbf{z} = (z_1, \dots, z_n) \). These vectors encode the relevant
	information from the input. The decoder then generates an output sequence
	\( (y_1, \dots, y_m) \) one step at a time. It is \emph{auto-regressive},
	meaning it uses previously generated outputs (e.g. \( y_1, y_2, \dots \)) as
	input when generating the next \gls{token}. This setup allows the model to generate
	coherent and context-sensitive output, building each element of the sequence
	in a structured, history-aware manner \parencite[2]{vaswani2017a}.}
\glspl{rnn} processed \glspl{token} sequentially, passing information
through hidden states that decayed over distance, which made capturing
long-range dependencies difficult. While they could build semantic connections
reasonably well, they failed to construct robust models of language regardless of the training scale. \glspl{cnn}, while more parallelisable, were constrained by \gls{kernel} sizes and fixed receptive fields. Both designs struggled with tasks requiring global relational awareness of a sequence. In order to both build the long-distance relationships layered over huge datasets, and internalise the play between dimensionality reduction and reconstruction on a multi-processed surface, something much more powerful was needed.

\subsection{Transformative Attention and Signs without Signification}\label{sec:transformer}

The Transformer architecture
marked a decisive
break from the sequential bottlenecks. Dispensing with recurrence and
localised convolution, it introduced \emph{self-attention} as a
mechanism for computing contextual representations. In simple terms, self-attention
allows the model to decide which parts of the input are most relevant to each other when producing an output. In a single operation, every
\gls{token} in the input sequence attends to all others, producing
weighted combinations of contextually relevant elements \parencite[4]{vaswani2017a}. In their groundbreaking paper, \citetitle{vaswani2017a}, \textcite{vaswani2017a} proposed a new architecture that preserved the encoder–decoder structure but eliminated reliance on recurrence and convolution. Instead, the Transformer model relied entirely on attention mechanisms, not as a supplementary feature but as the foundation of both the encoder and the decoder (\cite[1–2]{vaswani2017a}; see Figure~\ref{fig:attention} for an illustration). This architectural shift allowed for highly parallelised computation, better modelling of long-range dependencies, and significant improvements in scalability. The Transformer has since become the cornerstone of contemporary \gls{genai}, enabling many of the recent breakthroughs in large-scale language modelling and generative systems. The architecture is built from stacked encoder and decoder layers, each composed of multi-head self-attention and pointwise feed-forward networks. These attention heads act as differentiated channels through which the model adjusts its internal representations, integrating multiple semantic and syntactic perspectives concurrently. Instead of treating \glspl{token} as isolated or sequential entities, attention turns the entire sequence into a site of mutual interaction, where each \gls{token} is redefined in relation to all others. By eliminating recurrence and convolution in favour of attention, the Transformer
achieved two decisive outcomes: first, it enabled much more comprehensive and
effective training on vast datasets; second, it allowed the model to capture long-distance
connections and complex contextual relations with unprecedented efficiency,
thereby overcoming the failure of previous \gls{nn} architectures to produce a representation capable of capturing the essence of the vast datasets on which they were trained.
These properties form the \emph{technical substrate} upon which modern
\gls{genai} and \glspl{llm} are built, leading to text-to-text models like
ChatGPT, as a result of developments in \gls{nlp}, and models like
Midjourney in image understanding and computer vision (see
\cite[2]{ploennigs2023}).

Conceptually, the Transformer establishes a \emph{global field of relation}, where each \gls{token} is encoded not in isolation or rigid sequence but through its distributed relevance to all others. This process builds on the algebraic nature of the tokenised dataset embedded in the high-dimensional \textit{vector space} (the feature space introduced above), where semantic and syntactic relationships are captured as measurable distances. The architecture thereby creates a form of synchronic awareness: the presence of every other word is embedded within the representation of each word. The high-dimensional \textit{feature space} encodes tokens as points separated by specific distances (as a metric space), turning both the position of a \gls{token} and its relation with other \glspl{token} into numerical values, making it possible to perform relational operations such as $king - man + woman \approx queen$ \parencite[]{aig2025}.\sidenote{See a demonstration of this well-known operation in the \textit{GloVe}
	word vector space \cite[]{pennington2014}, presented in
	ANNEX~\ref{cha:word_embedding}.
	The operation yields a cosine similarity of $0.861$ (very high) for the terms
	on both sides of the equation.} Similarly, a polysemous word such as
\textit{bank} can be shifted towards different meanings: if the position is shifted in the direction of finance, its neighbourhood becomes populated with \glspl{token} such as \textit{securities, banking, investment, credit}, whereas if shifted towards the position of \textit{river}, its neighbourhood changes to \textit{flows, shore, stream, along}, and so forth (see ANNEX~\ref{cha:word_embedding} for a demonstration). The same mechanism, however, can also embed and even amplify specific biases in the data. For instance, embeddings may position professions such as \textit{director, officer, policymaker, programmer} closer to male-related \glspl{token}, while \textit{hairstylist, receptionist, nurse, veterinarian} are drawn towards female-related tokens. Likewise, scientific terms tend to cluster more closely with male-related tokens, whereas terms linked to the arts are positioned nearer to female-related ones (see ANNEX~\ref{cha:word_embedding} for the code and examples).

This reconfiguration of relationality is the basis of the
efficiency, scalability, and generative fluency that define modern \glspl{llm}\sidenote{Although we are focusing specifically on \glspl{llm} here, the
	transformer architecture is also embedded in other \gls{genai} models such as
	\textit{text-to-image} generators.}.
Modern architectures, however, go well beyond this initial plane of
departure. One of the most important aspects that gave transformer-based
systems their edge over anything else in terms of relationality was how
\textit{attention} was utilised.
The \emph{attention mechanism} is mainly responsible for improving the interaction between
input and output, allowing the model to
dynamically focus on the most relevant parts of the input sequence while
generating each \gls{token}. Attention computes a set of weights over the input representations, effectively answering the question: \enquote{Which parts of the input matter most for predicting the next output?}\sidenote{Technically, self-attention calculates relationships between \glspl{token} by projecting them into \emph{query}, \emph{key}, and \emph{value} vectors. These are used to compute attention weights through dot-product similarity and softmax normalisation.} Each token’s final representation is thus a weighted blend of all other tokens, adjusted by their contextual relevance. Through multiple stacked layers and attention heads (multi-head attention), the transformer operates on different planes of relevance. For example, while one head may attend to the single most relevant token in the input, others simultaneously track secondary relations or longer-range dependencies. In this way, multiple assessments of relevance are carried out at once, both within the input sequence itself and between the input and the model’s internalised representation of the whole training data (see e.g. \cite[]{merritt2022}). As Vaswani puts it \parencite*[]{merritt2022}, \enquote{meaning is a result of relationships between things, and self-attention is a general way of learning relationships.} Probabilistic modelling, together with these long-distance relational adjustments, then governs how the network moves through representational space to predict the next output (see \cite[198]{montanari2025}). Moreover, the multi-head parallelisation of attention attached to both encoder and decoder processes (see Figure~\ref{fig:attention} for the official illustration) \enquote{allows the model to jointly attend to information from different representation subspaces at different positions} \parencite[4]{vaswani2017a}.
The transformer model, so to speak, has \enquote{\textit{radicalised} the use of attention in sequence-to-sequence language modelling, dispensing entirely with recurrence and convolution in favour of an ensemble of attention mechanisms} \parencite[6]{amoore2024}.


\begin{marginfigure}
	\includegraphics[width=\textwidth]{images/attention.png}
	\caption{The original Transformer Architecture with built-in Multi-Head Attention Mechanism in Encoder and Decoder Processes (cf. \cite[3]{vaswani2017a}) }
	\label{fig:attention}
\end{marginfigure}


Maas \parencite*{maas2023} associates this novel operational structure of the Transformers with Derrida's concept of \textit{trace} (see e.g. \cite[26]{derrida1998}). Derrida's concept is an advancement of Ferdinand Saussure's linguistic theory of \textit{signifiers} and \textit{signifieds} (see e.g. \cite*{saussure2007}) through his own concept of \textit{différance}, in which the emphasis shifts to the context-dependency of words and their differentiation from each other. For instance, the colour \textit{red} is defined through its differentiation from \textit{green} and \textit{blue} without having any actual substance of its own \parencite[9]{maas2023}. \enquote{The sign has no component that belongs to itself only; it is merely a collection of the traces of every other sign running through it} \parencite[44]{cilliers2002}. All signs are in continuous relationship with other signs, where the position of a word within the current network of connected signs, and their differences\sidenote{In terms of word embeddings in \glspl{llm}, we can interpret these differences as distances, since distances in the network represent the model's way of encoding differences between concepts.} from that specific sign, establish its substance. Yet the substance or \textit{meaning} of the sign has temporal dependency, because the specific arrangement of words, as well as the differentiation between them, is in constant flux, \enquote{in a dynamic process of combination and referencing} \parencite[44]{cilliers2002}, dependent on the current context\sidenote{I will be referencing this temporal formation as an \textit{instance} from now on, since the context-dependency of the network narrows meaning into one instance of the connectional structure.}. Similarly, in the operation of \glspl{llm}, this spectral interdependence, where tokens are mutually inscribed into one another, suggests a structure in which meaning is always already haunted by the rest of the utterance (see \cite[12]{maas2023}) in the sense of Derrida's \textit{trace}. The \textit{meaning} of words in \glspl{llm} is defined by overlapping distributions: the distribution within the sentence, the distribution the model renders across the whole dataset, and other dynamic mechanisms regulated by the Transformer core all working on the signification layers in the formation of traces going through the specific word (\gls{token}) attention mechanism is focusing on.

\citeauthor{montanari2025} \parencite*{montanari2025} draws a direct connection between the cognitive functions mimicked by Transformer architectures and the cultural implications of \gls{genai} models. The ability to construct relationships between concepts that are distant from one another is precisely what enables \glspl{llm} to understand and articulate metaphors\sidenote{For a short and engaging discussion of this capability, see \citeauthor{heerden2024} \parencite*[]{heerden2024}, who trained a simpler form of \gls{llm} that nonetheless succeeded in grasping and generating poetic metaphors in a low-resource language such as Afrikaans, using only a fraction of the text data available for English.}.

\begin{quote}
	[T]ransformer models, which exemplify the interplay between metaphor and function. Transformers [...] simulate certain structures and functions of the human brain, excelling at processing sequential data such as words in a sentence or notes in a melody. The transformative innovation within Transformers is the \enquote{attention mechanism,} which enables the model to focus selectively on the most relevant parts of the input sequence. This mechanism is pivotal for discerning complex relationships and dependencies within data. [...] multi-head attention mechanism, a key feature that captures diverse aspects of an input sequence simultaneously. This dual role of technical objects – functionally specific and mythically resonant – reveals their broader cultural impact. Technical metaphors, often catachrestic and hybridised, solidify not only the utility but also the mystique and credibility of AI systems.

	\citereset
	— \cite[206]{montanari2025}
\end{quote}

\citeauthor{montanari2025}’s analogy between metaphor and function illustrates how the distinctive capacities of transformer architectures become visible at the level of their outputs. By design, transformers are highly efficient translation machines. One of the most prominent challenges in \gls{nlp} that the transformer architecture immediately rendered trivial was language translation. Yet this capacity extends beyond linguistic translation: the same mechanism of associating distributions across data allows for effective cross-modal mappings, such as text-to-speech or text-to-image generation. From the perspective of meaning-making, the production of sense in transformer-based models can be understood as a continual translation, moving between stratified elements and overarching concepts, where meaning emerges fluidly from the situated application of traces within each exchange. \citeauthor{aig2025a} \parencite*[]{aig2025a} points to the \gls{dg}’s notion of \enquote{double articulation} as a way of theorising this machinery. Double articulation concept in \gls{dg}'s theory describes how structures are formed on two surfaces of production: a molecular articulation, where raw flows of matter, energy, or desire are segmented, and a molar articulation, where these segments are organised into larger social, linguistic, or institutional forms. For example, in language, sounds (molecular) are articulated into words and grammar (molar). This shows how every stratum, from biology to society, emerges by combining micro-processes with macro-organisation (see e.g. Chapter~3 in \cite[]{deleuze1987}).

The Transformer operates simultaneously on two strata: a molecular level of local attention, where data is tokenised, and neural activations are formalised into specific connection patterns that correspond to certain concepts, clusters in the feature space, or relations of distances and neighbourhoods; and a molar level, where these are aggregated into larger representations and models capable of generation. These molar structures regulate flows of input and output, steering and shaping responses (see \cite[]{aig2025a}). In the language of \gls{dg}, every input sentence first undergoes a process of deterritorialisation\sidenote{As partly introduced in Chapter~\ref{cha:control}; the concepts of de- and reterritorialisation capture the way systems detach from established arrangements and connections, opening the possibility of being reconfigured along novel trajectories. See Chapter~\ref{cha:conjunctive} for a further discussion of these themes. }, where its components are broken down, only to be reterritorialised according to the molar aggregates the model has constructed in order to generate a response. \citeauthor{aig2025} provides an exemplary process:

\begin{quote}
	Consider the example of processing the sentence \enquote{She is a scientist. She conducted an experiment}:

	1. Each token [in input] (``she,'' ``is,'' ``scientist,'' etc.) is first converted into a
	distributed representation (embedding vector).

	2. In the self-attention mechanism, each token calculates its ``relevance'' to
	all other tokens.

	3. For example, the second ``she'' has strong relevance to the first ``she,''
	``understanding'' that they refer to the same person.

	4. This ``understanding'' does not arise from centrally controlled rules but
	emerges from molecular interactions among countless parameters.

	\medskip

	In this process, the calculation of ``relevance'' (molecular process) and
	the understanding of the entire sentence’s meaning (molar structure) occur
	simultaneously. This is not simple hierarchical processing but a constant
	interaction between local computations and global meaning structures.

	\citereset
	— \cite{aig2025}
\end{quote}

Furthermore, each of the \glspl{token} are also in relation with others in the
feature space (see the \gls{token} value examples above), the \enquote{she} in the
sentence is going to be affected by how \enquote{she} is positioned in the feature
space and vice versa. Therefore, the Transformer thus embodies a form of double articulation in machinic sense-making that extends beyond its internal core, on lots of different connections and layers. Attention mechanisms enact selective intensities across the tokenised field, instantiating meaning not as fixed symbols but as
weighted relationalities. These differential proximities constitute a \emph{diagrammatic space}, where meaning emerges through modulation rather than rule-based inference. On one side, meaning is fluid and continually adjusted through local token interactions; on the other, this fluidity is anchored in molar distributions extracted from the entire dataset. Attention weights thus instantiate the selective intensities that bind the micro-variations of input to macro-level patterns of representation. Yet this double articulation of meaning does not end with the attention mechanism itself. It is carried further into the training process, where the modulation of connections is made possible by \textit{gradient descent} and \textit{backpropagation}, which iteratively recalibrate the network’s parameters to stabilise these diagrammatic fields of relation.

\subsection{Sinking into the Manifold: Gradient Descent and Backpropagation}\label{sec:gradient}

While the Transformer architecture introduces previously unseen connective capacities for building relevance between distinct concepts in data, other \gls{ai} methodologies play a pivotal role in solidifying the structures that emerge in the process. Most optimisation methods in \gls{ml} are grounded in differential calculus, with the calculus of variations providing the basis for adjusting model behaviour. \Glspl{loss} are critical for assessing how well a model performs on given data and are typically chosen to enable efficient optimisation. Simply put, a \gls{loss} function calculates the \textit{difference} between the delivered outcome and the desired outcome. As the model runs through training cycles (\glspl{epoch}), the outcomes of the \gls{loss} define a surface, a manifold of values (see Figure~\ref{fig:gradient}
for a visualisation of such a surface and gradient descent’s steps on it). \textbf{Gradient descent} is the method that traverses this manifold, systematically updating parameters in search of minima (or maxima) on the manifold’s surface. In practice, this amounts to the model seeking results that are as close as possible to the expected outputs.\sidenote{Formally, for a differentiable loss function \( L(\theta) \), the update rule is:
	\[
		\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)
	\]
	Where \( \theta \) represents model parameters, \( \eta \) is the learning rate, and \( \nabla L(\theta_t) \) is the gradient of the loss function with respect to the parameters at iteration \( t \) \parencite{tarmoun2024}. Think about analysing the steepness of the surface starting from a random point and moving in the direction of the steepest downward angle to the bottom (where the \textit{average} \gls{loss} value is smallest) by updating the values of the nodes in the \gls{nn}.}

Within \gls{nn} and \gls{dl} applications, this process unfolds on a massive scale, where countless parameters are iteratively tuned to reduce error and refine performance (see \cite[97]{mackenzie2017}). Gradient descent is a fundamental optimisation algorithm used to train \glspl{nn} by updating parameters in the direction that reduces the loss function. If we visualise the outcomes of the \gls{loss} evaluations as a manifold, a surface with ups and downs,\sidenote{With ups being where the model in training performed the worst, hence the \gls{loss} is high, and downs being where the model was more precise.} we can think about the gradient descent function trying to take steps down to the lowest part (a local minimum) of the manifold, much like \textit{taking steps down a hill} (see Figure~\ref{fig:gradient}). As the cycles (\glspl{epoch}) pass, gradient descent adapts the model in the direction of lower and lower outcomes of the \gls{loss} until there is a convergence in outcomes. Gradient descent is the function that minimises the error between predictions by adjusting the weight of the stronger options (see \cite[100]{mackenzie2017}). It is a way for a neural network to reach towards the stronger and more prominent prediction instead of getting stuck in similarly good answers whenever the number of possible candidates for prediction is high. To illustrate how gradient descent works in practice, consider a model trying to distinguish between handwritten digits, such as \enquote{6} and \enquote{8}. At the beginning of training, the model’s predictions are almost random. After seeing one example of a \enquote{6} misclassified as an \enquote{8}, the algorithm computes how much each parameter (e.g., a weight in the network) contributed to the error. Gradient descent then updates these parameters slightly in the direction that would have made the prediction more accurate. This process repeats for many examples, gradually strengthening the neurons in the \gls{nn} that lead to this specific outcome to reduce its overall error. The model is slowly emphasising through the repetitions (\glspl{epoch}) what made different examples most distinct and exaggerating those differences.

\begin{marginfigure}
	\includegraphics[width=\textwidth]{images/gradient_descent.png}
	\caption{Non-convex optimisation: Utilisation of gradient descent to find a
		local optimum ona loss/cost manifold (cf. \cite[3]{amini2018})  }
	\label{fig:gradient}
\end{marginfigure}

Rather than a simple algorithmic mechanism, gradient descent can be interpreted as an expression of difference-in-repetition in the Deleuzian sense: each pass through the data does not reproduce identical results but introduces micro-variations that progressively reshape the model’s internal parameters. The model does not approach a universal form; it acquires an operational sensitivity to local singularities distributed across the dataset. Through repeated exposure over many \glspl{epoch}, differences accumulate: each adjustment is almost imperceptible on its own, yet taken together they carve out patterns that make further pattern-recognition possible. The model does not begin with a pre-given \textit{model}; it derives one through its iterative engagement with data.\sidenote{However, it should not be forgotten that this learning is completely bound to the scope of the data. An \gls{llm}, for example, is purely encircled in the language it has been exposed to.} A trained model that appears to \enquote{know} an image of a tree, for instance, has not encoded a definition but has undergone enough transformations to resonate with distributed features constituting \enquote{treeness} across the dataset. This is not epistemology in the classical representational sense, but a diagrammatic form of learning: one that forms through modulation and intensity rather than classification and identity. Gradient descent, in this framework, appears not as descent toward a pre-defined minimum but as an ongoing negotiation across a surface of potentials, a diagrammatic inscription of learning as continuous variation. \citeauthor{delanda2011} \parencite*[89-90]{delanda2011} draws a similar conclusion by describing gradient descent’s role as learning from experience:

\begin{quote}
	We need a design consisting of two multilayer perceptrons, one to generate a non-symbolic representation of the unconditioned stimulus and the other to generate one of the conditioned stimulus . The first neural net plays the role of an inherited reflex so its configuration of weights must be rigidly fixed as if it had been found through evolutionary search, while the second one must be able to learn from  experience, that is, the weights of its connections must be found through gradient descent. Finally, the hidden units of each neural net should be connected to each other laterally in such a way that their non-symbolic representations can interact with one another.

	\citereset
	— \cite[89-90]{delanda2011}
\end{quote}

Multilayer structures in \gls{nn} models are the essence of non-symbolic representation in \gls{ai} systems, and in order to be able to communicate with each other, some functionality has to be able to \textit{decide} which way to go to be more precise and not fall into a paralysis of indifference. Gradient descent fills exactly this role; one has to ask whether what is communicated by gradient descent necessarily approaches a \textit{correct answer} or simply amplifies whatever seems to be the stronger or clearer argument. If this is the case, how does the model become aware and adapt itself to this specific feedback? How does the difference in repetition emerge if the process is linearly reacting to the input with an output (input $\rightarrow$ output)? How are the non-symbolic patterns in the layers of the \gls{nn} \citeauthor{delanda2011} mentioned updated (input $\leftrightarrow$ output)? If the previous perspective emphasised how gradient descent inscribes learning as continuous modulation, \citeauthor{mackenzie2017} extends this line of thought by showing how optimisation can give rise to entirely new regimes of meaning and practice. He defines this process as the implementation of a \textit{new model truth}:

\begin{quote}
	New kinds of realities arise in which the classifications and  predictions generated by the diagonal connections between mathematical functions and operational processes of optimization can constitute a \enquote{new model truth} and can unmake \enquote{preceding realities and significations.} Despite my deliberately narrow focus  on a single set of relays that connect linear models, the logistic function, the cost function, and gradient ascent[\textit{or descent}], hundreds and perhaps hundreds of thousands of \enquote{points of emergence} associated with this diagram of functioning.

	\citereset
	— \cite[101]{mackenzie2017}
\end{quote}

The endless \enquote{points of emergence}, and the ability of the model to be
steered in vastly numerous ways, as \citeauthor{mackenzie2017} \parencite*[99-105]{mackenzie2017}
mention, are made possible by the addition of different \gls{dl} building blocks.
An especially effective one of them is \textbf{backpropagation}, which plays a pivotal role in consolidating the
operation of gradient descent. In early forms of \gls{symai} (or \gls{gofai}),
the process of inference followed a rigid \textit{forward propagation} model.
Logical rules, handcrafted by programmers, operated on symbolically encoded inputs to produce outputs through a chain of deductive reasoning steps going \textit{forward} across the layers of \glspl{neuron}. Following the questions above, \textit{forward propagation} operates well if the \textit{truth} is already known and if it is clear which kinds of outputs the model should produce. The limitations of \gls{gofai} became increasingly apparent in tasks involving ambiguity, noise, or vast data spaces, domains where human cognition thrives not by rule-following but by plastic, adaptive learning, as discussed in Section~\ref{sec:ai_history}. Backpropagation plays a pivotal role in changing the course of \gls{nn} systems by allowing networks to \emph{learn} from error. Rather than only pushing activations forward, as in \gls{gofai}, backpropagation pushes \emph{errors backwards} (see Figure~\ref{fig:backpropagation} for a simple illustration) through the network to update internal parameters and improve future predictions.\sidenote{Formally, the weight update rule in backpropagation is given by:
	\[
		w^{\text{new}} = w^{\text{old}} - \eta \frac{\partial E}{\partial w}
	\]
	where \( \eta \) is the learning rate and \( \frac{\partial E}{\partial w} \) is the partial derivative of the error function \( E \) with respect to the weight \( w \) \parencite{hecht-nielsen1992}. This formulation ensures that each parameter is updated in proportion to how much it contributed to the error. \textcite{hecht-nielsen1992} describes backpropagation as a paradigm-shifting method for approximating functions \( f: \mathbb{R}^n \to \mathbb{R}^m \) using layered neural structures. Unlike Hebbian learning, which depends on co-activation, backpropagation relies on the explicit transmission of error signals. These signals traverse the network in reverse order, enabling a distributed form of learning where each parameter is tuned with respect to its role in the total output error.}
However, the difference between the updates performed by gradient descent and backpropagation is that gradient descent only updates the immediate neurons bound to the prediction, whereas backpropagation pushes the adaptation introduced by gradient descent to previous layers. While going back layer by layer, the backpropagation process updates all the weights of the \glspl{neuron} on the network that strengthens a specific preferred outcome favoured by the gradient descent process. Backpropagation thus functions as a bidirectional mechanism: during the \textit{forward pass}, inputs are transformed into outputs through successive layers; during the \textit{backward pass}, the discrepancy between the prediction and the target is used to adjust the weights in a way that gradually minimises this error.


\begin{marginfigure}
	\includegraphics[width=\textwidth]{images/backpropagation.png}
	\caption{A simple illustration of how backpropagation updates the neurons
		among the layers of a \gls{nn} in a backwards manner (cf.
		\cite{3blue1brown2017})  }
	\label{fig:backpropagation}
\end{marginfigure}


Backpropagation gears the system
towards being radically feedback-oriented.
Together with gradient descent, it establishes an early process of reterritorialisation: stronger patterns in the data are reinforced, while the entire network adjusts around these emerging tendencies. This leads to more
precise answers for concrete tasks, such as the example above of recognising a
handwritten digit. We can say that while gradient descent is responsible
for making the stronger distributions or arguments more apparent,
backpropagation is responsible for updating the entire network in relation to
those strong arguments. Yet this raises an important theoretical question: what follows from a learning paradigm that continually amplifies patterns already given greater weight by the data, especially in meaning-making processes?
Attention mechanisms in transformer architectures extend this dynamic. They enable the model to link distant features within the data and to form associations that are not restricted to local proximity. Through successive rounds of prediction and adjustment, the network converges on outputs that appear convincing without relying on any predefined notion of correctness. Feedback on these outputs is propagated backwards, refining the network by strengthening the connections that proved effective. Although many additional components intervene in large models \parencite{mackenzie2017}, the essential elements presented here show how learning unfolds through continual binding and unbinding of patterns. Local interactions between individual neurons crystallise into higher-level structures that guide prediction, and these structures are repeatedly revised as feedback circulates. In this sense, processes akin to de- and reterritorialisation are enacted within the technical substrate itself, shaping how the model stabilises distinctions, relations, and meanings.

\subsection{Body without Neurons: Fitting \& Tuning}\label{sec:fine-tuning}

The mechanisms of gradient descent and backpropagation are powerful tools for shaping a network’s internal structure, but they emphasise an ever threatining tension in the whole history of \gls{ml}: the balance between underfitting and overfitting. Overfitting occurs when a model is too tightly bound to its training data (capturing noise and idiosyncrasies along with signal) and thus fails to generalise to new examples. The model gets so tightly optimised to whatever training data it contains that it cannot adapt itself to inputs that do not resemble the training cases exactly.  In that regime, it essentially \enquote{memorises} statistical associations rather than producing abstract generalisations. Underfitting, by contrast, happens when the model is too constrained or too simple to capture the meaningful patterns in the data; it performs poorly even on the training set. This tension is often analysed through the bias–variance tradeoff: models with high variance tend to overfit, while models with high bias tend to underfit \parencite[see][]{avati2019}. In \glspl{nn}, which typically have high capacity, the risk of overfitting is especially pronounced. To mitigate that risk, practitioners use regularisation methods. One well-known technique is, for example, dropout \parencite[see][]{srivastava2014}, which randomly deactivates neurons during training so that units do not co-adapt excessively. This method has been shown to improve generalisation across vision, speech recognition, and text tasks. Dropout thus acts as a check on overfitting, forcing the network to maintain flexibility and prevent collapse into brittle, overly specific pathways.

In theoretical terms, overfitting may be read as a kind of sedimentation \parencite[see][14]{rijos2024}, where meaning is layered rigidly into entrenched pathways that suppress variation. The model’s representational surface becomes ossified, reducing its potential for novelty. Dropout and similar interventions act as gestures of desedimentation; they rupture hardened pathways and preserve an openness to difference. Underfitting, by contrast, might be imagined as refusing to territorialise structure at all: too porous, too unformed, and therefore unable to stabilise meaningful relations. The mechanisms, such as gradient descent and backpropagation, are effective processes for optimising a model into a specific structure; however, we are always facing the risk of embedding too deeply in a foundation. In \gls{dl} architectures, one of the central tensions lies in the risk of overfitting, a condition in which the model becomes excessively entangled with its training data and fails to generalise beyond it. In such cases, the model \textit{memorises} statistical associations without achieving flexible abstraction. Overfitting, in this sense, resembles the psychic intensification of repression: a becoming-too-organised. The network loses access to variation and begins to loop within captured redundancies \parencite[]{srivastava2014}. Within this context, \gls{dg}’s concept of
\gls{bwo} \parencite[see][]{deleuze1983}  becomes analytically useful. The \gls{bwo} designates a surface of
immanence that resists stratification, function, or stable identity. It is not
chaos, but a zone of potentiality that counters rigid organisation. But at the
same time, the \gls{bwo} of a social organisation defines how the productive
forces build their connections, how they interact like a mass with
gravitational pull affecting the socius. \Gls{dg} deliver the most practical example of this often misunderstood concept by elaborating on the role of capital under capitalism:

\begin{quote}
	Capital is the BwO of capitalist or of the capitalist being. Machines and
	agents of production seem to be \emph{miraculated} by it, they cling to it
	closely, they orbit around its gravitational pull. Everything seems as if it
	was immediately produced by capital. At the beginning the relation between
	the productive forces and capital, the opposition between the labour forces
	and capital are apparent, as well as the use of capital to extort surplus
	value. But as capital plays the role of the \emph{recording surface} of
	production (recording surface because the very production itself is defined
	by its terms), it \emph{falls back on} all production, becomes a mystic being
	since all labour's social productive forces appear to be due to capital,
	rather than labour itself as the core of production, and seem to issue from
	the very womb of Capital itself; thus the fetish is established
	\parencite[10]{deleuze1983}.
\end{quote}


As the capital functions as a
\textit{recording surface} that absorbs and reorganises all production, as
discussed in the previous sections, contemporary \gls{genai} models have the
tendency to create some molar formations in their productive systems, so that
the \textit{gravitational pull} of these formations ever affects the whole
productive process. However, it has also been discussed that there is a constant dynamic process where molecular formations break down and reform new molar aggregates in the training of the models.
In a similar fashion, when a \gls{nn} overfits, the statistical
associations it produces appear to emerge directly from the training data, as
if they were self-evident truths that get solidified in a process of
recording, inscription, and reorganisation, shaped by gradient descent and
backpropagation. It becomes a true memorisation instead of learning,
The productive tension between constraint and openness mirrors \gls{dg}’s view of creative generation as a differential process, from their continual negotiation instead of emerging from the absence of limits. Thus, rather than viewing dropout or regularisation merely as technical tricks, they can be understood as micro-strategies of desiring-modulation, machinic interventions that resist the ossification of the model’s internal landscape, preserving its capacity to mutate and adapt.
Overfitting, in this reading, becomes a kind of excessive clinging to the \gls{bwo}: the model orbits too closely around a flattened plane of inscription, reinforcing the strata of its own training surface until all variation is collapsed into overemphasised pathways.

This tension does not end at optimisation. The processes we have seen so far are associated with the training phase; contemporary \gls{genai} systems undergo an immense \textit{fine-tuning} process after \textit{pre-training}. In pre-training, models are exposed to enormous corpora of unlabelled text, predicting masked or subsequent \glspl{token} to build statistical representations of the data’s substance. Pre-training can be read as a process of continual de- and reterritorialisation, where data flows are broken down into molecular components and reassembled into provisional molar aggregates through countless repetitions across \glspl{epoch}. Fine-tuning, by contrast, is a process of pure reterritorialisation, using methods such as \gls{rlhf}, a technique in which human judgements are used to guide model behaviour. Concretely, RLHF works by training a reward model from human preferences (for example, humans rating or ranking model outputs) and then using reinforcement learning algorithms to adjust the pretrained model so that its outputs increasingly align with those human-derived rewards \parencite[see][]{bai2022}. In this phase, the model’s outputs are sculpted to align with human-defined norms, values, or tasks \parencite[964]{dishon2024}. Certain behaviours are amplified, others suppressed, not by statistical extrapolation but by normative or task-based criteria imposed directly by human agents. What begins as a relatively open structure of statistical potentials becomes constrained and legible: the model is tuned to act in ways deemed acceptable or desirable within a social or domain context. This method has been shown to drastically improve a model’s \textit{usefulness}, making it more helpful, reducing its tendency to respond to harmful requests, and increasing resilience to \textit{jailbreaking} attacks (see \cite[5]{bai2022}; and for a discussion about jailbreaking, see Section~\ref{sec:jailbreak}). Yet there are trade-offs. Intensive \gls{rlhf} can render models manipulable or sycophantic: the tendency increasingly observed in recent \glspl{llm}, where the model becomes overly polite, uncritically supportive, or constantly affirmative even in the face of obvious user errors, is one of the downsides of immersive fine-tuning (see \cite{sharma2025}, and further discussion in Section~\ref{sec:sixhats}). This tendency also points towards an overwhelming drive to (over)personalise outputs in the model’s attempt to appeal to the user.

Another widespread misconception that \glspl{llm} merely
\enquote{predict the next word} is relevant here. While this description captures their formal training
objective, it drastically understates what these models are doing. As
\textcite{dalvi2025} argues, LLMs are more accurately described as
token-emitting agents trained under multiple objectives, with next-token
prediction forming only the foundation. Instruction fine-tuning and
\gls{rlhf} build upon this basis by directing outputs according to human
preferences and task-specific norms. Although prediction remains the
mechanism, the goal changes: words are selected to maximise alignment with
reward signals rather than simply to continue a sequence. Dalvi compares this
to a chess engine, which does not merely select the statistically most common
move but chooses actions that maximise the likelihood of winning in context.
What looks like a linear continuation is therefore the result of a complex
representational process shaped by both statistical learning and normative
inscription. As \textcite[5]{amoore2024} notes, "predicting the next token in a
sequence affords a capacity beyond the sequence itself: an understanding of the
whole structure of the underlying text". The metaphor of a next-word predictor
therefore conceals more than it reveals, reducing a complex diagrammatic
operation to a trivial procedure.

In this trajectory, from expansive pre-training to targeted fine-tuning, we see the same dialectic as in under- and overfitting: the risk of sedimentation and desedimentation, ossification and rupture, openness and closure. Both stages reveal how optimisation in \gls{genai} is not a purely technical process but bound to questions of which forms of meaning are allowed to solidify and which remain open to variation. This also manifests as an overly person-oriented (personalised) tendency in the model’s attitude. Arguably, this trajectory, from expansive, indeterminate modelling to focused, value-laden calibration, marks a shift in the way meaning is operationalised. In pre-training, the model functions as a medium for representing statistical potentials; in fine-tuning, it is moulded into an instrument of specific sense-making. The pre-training process (see the previous sections) can be understood as continual de- and reterritorialisation, where the model breaks down components of the data and builds them back up according to the formations extracted thus far, through repetition across \glspl{epoch}. Fine-tuning, however, is a process of pure reterritorialisation that directs the model’s capacities toward specific purposes. In sum, fine-tuning via \gls{rlhf} can be read as a second, more authoritarian phase of reterritorialisation: it binds the model tightly to the norms of its controlling agents (designers, annotators, institutions). Pre-training grants a provisional openness; fine-tuning forecloses much of it, determining which flows are permitted to persist.

\section{Chapter 3 Summary}

Chapter 3 analysed the historical and technical development of contemporary \gls{ai} in order to understand how generative models participate in governing information. It traced the shift from symbolic reasoning to statistical and connectionist approaches, showing how \glspl{nn} and \gls{dl} architectures replaced fixed rules with distributed representations learned from data. Early deployments of these systems in search engines, ranking algorithms, and recommender platforms illustrated how profiling, feedback loops, and behavioural steering established the foundations of algorithmic governance.

The chapter then examined the mechanisms that distinguish \gls{genai} models, including feature spaces, dimensionality reduction, attention, gradient descent, and backpropagation; and how the transformer architecture fundamentally changed the capabilities of \gls{genai} models. These processes construct associations, stabilise patterns, and recalibrate internal configurations across iterative cycles. Rather than serving as neutral tools, contemporary architectures shape how meaning is produced and circulated, enabling models to participate in narrative formation and interpretation.