Untangling homonym representations in BERT, Part 2: Phenomenology

In this post, I’ll use the tools from the previous post to examine BERT-Base, an open source deep transformer model which performs well on natural language processing benchmarks (see the first link). We will find that in deeper layers, this model has learned to represent homonyms in such a way that they can be easily discriminated. This post will cover just the phenomenology: to what extent are the representations of two homonyms disentangled from each other in each layer? In the next post, we will dive into how the representations come to be disentangled from each other.

Untangling homonym representations in BERT, Part 1: Measuring untangling

My larger goal is explaining the circuits by which a transformer trained for natural language processing parses ambiguous language– in particular, homonyms. First, it is important to think about what it might mean to parse a word whose definition is ambiguous. Two words with the same spelling are necessarily similar in their input representation. If they are to be processed as having one of two distinct meanings (for example, based on their surrounding context), then across the layers of the network, this one cluster of words might come to be represented as two separate clusters. If this model were correct, representations of two instances of a homonym should share a cluster in a later layer of the network if and only if they share the same underlying definition. But what should it mean to share a cluster? In the following post, I explain two related metrics by which we can evaluate how differently two sets of points are represented. To establish an intuition, we first look to a more visual example: parsing images based on what objects are in them.

Untangling homonym representations in BERT

I’m a systems neuroscientist by training, just getting my feet wet in studying the circuit mechanisms by which transformers represent sequences of inputs in a way that supports “understanding,” as measured by NLP task benchmarks, excited tweets, etc.