GPT is only half of the AI language revolution

Don't be dissuaded from AI on the premise that content generation is all it can offer.

Even if you somehow managed to avoid all the talk of OpenAI and GPT longer than the rest of the world, by now every one of us has surely met with a flood of content from ChatGPT or generative image models like Stable Diffusion. These creations range from amazing, to weird, to concerning. And if your default stance is to doubt the staying power of such content, it can be counted a healthy skepticism.

But while the leap in generative capabilities stirs up a constant buzz of new novelties almost every week, content generation is only the visible surface of a single ongoing revolution in neural network capabilities–and the applications are not restricted to flooding the world with more cheap content.

On the contrary, the key neural network innovation most directly responsible for the recent surge – the transformer – has developed from its origins along two related but divergent paths. And just as its GPT side can lead us towards a world of increased noise, the other half is where you may find the most potential for applications that cut through the problem of information overload.

So before you write-off the possibilities of AI for your own domain, step away from the gloss of GPT and take a look at the other half of this technology which shares its innovations as a common root. As you'll see below, that is where the construction of a successful AI-driven application will begin, even if something like GPT ends up as one part of your final pipeline.

A brief history of the transformer

In 2017, a group of Google researchers published a paper in the domain of machine translation which could easily be mistaken for a minor incremental improvement called, "Attention is all you need." Reading the carefully moderated claims across its less than a dozen pages today, it seems likely that the authors did not anticipate the revolution it would inaugurate.

Without getting lost in the details, the paper proposed a new kind of translation architecture to supersede prior state-of-the-art techniques. By that time, machine translation models invariably followed the structure of having an encoder and a decoder, where the encoder takes the source language text and deeply encodes it into a series of vector representations, while the decoder unwinds that dense context into a generated text in the target language.

Or, in more functional terms, a translation encoder would condense the meaning of a source sentence into a compressed space like a wound ball of yarn, while the decoder would unwind that ball by pulling out a thread in the target language one word at a time, until the whole content is exhausted.

An encoder weaves information like a tight ball of yarn, whereas a decoder undravels that information. Image generated with AI.

The new transformer model retained this concept of an encoder and a decoder, but introduced a different way of propagating information across each of these halves. Instead of the standard recurrent networks which slide across a text one word at a time and build a running context memory in a piecewise manner (presenting many limitations of modeling longer sequences), the authors substituted an attention mechanism by which every single word within the text body can query for information from every other word in an asymmetric manner. This means that a word can project itself onto a kind of map in one way, project all the other words in a complementary way, then see where they line up--and extract content from the other words according to this map.

Since these attention mappings happen in parallel across the whole sequence, they have no boundaries and can connect concepts across the entire length of the source text, up to the model's designed size. By piling many layers of attention on top of one another, information paths are woven through the text so that the representation of any single word ends up receiving long strands of contextualized meaning.

Both halves of the translation network use the attention pattern, keeping the roles described above where the encoder condenses meaning while the decoder unwinds it into a new sequence. In practice, this approach quickly proved to solve long-standing problems with translation, both in the representation of input text and the generation of new sentences.

Splitting the halves - a tale of two transformers

Once separated from the machine translation task and applied to general language modeling tasks, transformer networks no longer need both an encoder and a decoder, and in fact either half can be trained as its own model in isolation using any available large corpus of text, without the need for labels or parallel sentences.

The encoder is primarily trained by masking words in the middle of content, with the training goal of producing deep enough connections across the words that the missing words can be inferred. The decoder, on the other hand, is trained by threading forward across a text and only predicting the next word. With the advent of distributed computation, a team with the resources of OpenAI can train either of these models on billions of texts without supervision.

Of course, GPT (Generative Pre-trained Transformer) is only a decoder, so that its sole capability is to take any sequence and keep pulling it forward by weaving all the contextual strands of its words into the next likely completion–but to great results, since comprehension of a long sequence already implies an implicit knowledge of any concepts it may comprise.

But an encoder trained alone is just as powerful for different purposes. BERT (Bidirectional Encoder Representations from Transformers) and its many variations (RoBERTa, etc) are the flagship models on the encoder side. Instead of producing more text, these networks deeply embed a give body of content into a vector representation. The result is therefore not more text, but is instead a vector in an embedding space, which can be used for many different kinds of downstream tasks.

BERT vs. GPT: not better, just different.

Why serious AI applications should start with encoders

Most of the uses of GPT today revolve around little more than "prompt engineering," which is a glorified term for wrapping the model with a bit of prepended text to give it direction for a task. Even in paid products that announce AI-powered features, this can often be the case. We have seen workspace and writing tools announce GPT integrations that, in the end, are little more than plugging prompts into the model and pushing the output into the existing application.

We didn't want to follow that path at Slite, because we regard the more pressing challenge to be the navigation and clarification of too much textual content, rather than the generation of even more. Our users already bring more than enough text to the table, but they face the natural problem of organization and navigation. So we set out to build a feature for live question-answering across entire knowledge bases. Even though any kind of automated Q&A will immediately call to mind services like ChatGPT, the reality is that generative models could only serve as one small part of the pipeline.

Why is this the case, when services like ChatGPT appear to be complete and fully versatile across any domain? Simply put, the vast baked-in knowledge that generative models have accumulated across billions of texts is your biggest enemy when applying this kind of tech to a strictly confined domain like a client's documents. What you want is for AI to power insights into a very specific problem or set of texts--not for it to ramble on with the general background knowledge that it implicitly draws upon at every moment. No amount of mere prompt-engineering can alter this situation.

When your AI model has access to too much knowledge, it can be hard to find the thing you need most.

The consequence is that even once we decided to use GPT to generate the final readable summary response to a question, the pivotal problem is putting constraints on that generation by feeding just the right information into it--otherwise your output will be injected with all kinds of confident-sounding but entirely incorrect statements in the absence of a clear and isolated context free from excessive noise.

What does it mean in practice to build an AI-driven pipeline that can intelligently process data, in such a way that GPT is left only as a small final component? Thinking back to the origins of transformers in translation tasks, we need to rely heavily on encoders before letting a generative decoder roll out a response. Otherwise, it may pull that reply out of thin air.

Encoding transformers

BERT and similar models use a transformer to take in text and output an embedding (or a classification, which would be derived from embeddings). The simplest way to understand embeddings is to see them as high-level representations of similarity relationships, where each vector is a point in space.

For instance, when a product like Spotify recommends related artists to you, they are searching in an embedding space for points that are close to the artist that you are listening to. Given that it is possible to combine, add, subtract, or average different vectors, additional information on artists you also like or even dislike may be used to alter the location of that search.

In Spotify's algorithm, music goes in, spatial coordinates go out

Since vector embeddings in language tasks now typically have well over 500 components, the kinds of emergent similarities and relationships that this high-dimensional space can represent between items is vast. The key choice to make with transformer encoders is to select (or later, fine-tune) a model that represents texts in an embedding space in a way that conforms to your purposes. Most of the default models you may encounter are trained to be for general use, so they may not draw the relationships you actually need.

For a relevant example, when trying to pair user-submitted questions to possible relevant documents in a knowledge base, you will not achieve great results on a typical model, because questions and statements are very different uses of language. Their dissimilarities outweigh their relationships in the embedding space. However, you can readily find well-trained models that were built with the specific task of aligning questions and statements, which means that they encode in such a way that close spatial relationships will emerge between questions and the sentences that might complete them.

One of the engineering advantages of vector embeddings is that you can store them outside of the model and then query this space efficiently for similarity, using a variety of available tools–and this will undoubtedly form the core part of any real-world application.

The capabilities of encoders go well beyond external queries, however. Drawing upon the advantage of transformers to track complex interrelations across words and sentences within the model, you can use a cross-encoding approach to send two passages or pieces of text through the model at the same time. Instead of scoring an overall similarity, this can evaluate the precise kind of relationship between the two passages in a more fine-grained way. To take one common comparison task, a model can judge whether the content and claims of one sentence imply, contradict, or are neutral in relation to another.

In comparison to mere prompt engineering or to relying only on a GPT-like model to handle your task, the additional control you gain by using encoders to meaningfully filter or evaluate existing content is dramatic. The simple rule is that a generative transformer is only as good as the content that you feed into it at any given moment, so for something like a Q&A system, the entire series of steps you take prior to GPT in which you extract and filter information from your documents will carry much more weight with regards to the quality of your output.

Applying encoder technology to a team knowledge base

For Slite's Ask feature–which allows users to put any question to their entire knowledge base–we built a pipeline of language processing that relies on all of the encoder capabilities above. The result is that we search, refine, and process text long before we submit anything to a generative model. And the role of that generative model is therefore tightly constrained: it should simply compress a short summary of the exact excerpts of content that we have already determined to be likely answers to the question. And thanks to the encoder pipeline and its careful filtering and scoring of text excerpts, we also are then able to cite the sources used for the answer with confidence–or even reject GPT's answer since it could not have derived from the given input.

Ask searches, synthesizes, and cites sources, all from your team knowledge base.

Curious about the future of knowledge management?
Get your copy of our Ebook in partnership with industry experts →

A couple of notable open-source tools that we relied upon and recommend:

Sentence Transformers - an excellent, all-inclusive Python library for using BERT-like models to embed or compare texts
Hugging Face - an indispensable resource for searching, downloading, or hosting transformer-based models

The key takeaway: machine learning is not just generation

The same innovations that led to impressive services like ChatGPT also power generative models in other domains, including images, music, video, and speech. The potential uses for these abilities in new services or products has still barely been tapped.

In all of these cases, however, the task ahead of you when planning to build a working use case is to look not only to the side of content creation, but to think carefully about what kind of data you are working with, and use the encoding capabilities of AI to filter, categorize, or otherwise manipulate your domain.

Last updated

February 3, 2023

Share this story

Written by

Jason Phillips is a developer and engineering manager at Slite, where he is mainly responsible for the collaborative editor as well as helping with other initiatives like Ask Slite. He worked on many data-centric projects prior to Slite, including practical applications of machine learning. Outside of work, he's a perpetual student who is always researching something obscure, and holds a PhD in Education Theory.