An Intuitive Logic for Understanding Autoregressive Language Models
A downloadable project
Transformer-based language models have shown a stunning collection of capabilities but largely remain black boxes. Understanding these models is hard because they employ complex non-linear interactions in densely-connected layers and operate in high-dimensional spaces. In this article, we address the problem of interpretability of large regressive language models with a principled approach inspired by basic logic. First, we show how classical mathematical logic does not grasp the reasoning system of these models and we propose the intuitive logic, which is notoriously asymmetric and redefines the classic logical operators. We then proceed with the localization of the activated areas associated with the conjunction, disjunction, negation, adversive conjunctions and conditional constructions.
From the localization results, we obtain topological important information about the network that induces the formulation of a conjecture about the mechanisms underlying the intuitive logic introduced in GPT 2-XL. We test the conjecture through model editing and conclude by laying the foundations for a connectomics for GPT.
Leave a comment
Log in with itch.io to leave a comment.