Transformers: More than meets the GenAI!

GenAI

LLM’s! Roll out! For those of you who didn’t yearn for an article about LLM’s that wasn’t a throwback to cartoons from the 1980s, then this article isn’t for you. For the rest of you, welcome. I couldn’t believe that I couldn’t find the phrase “more than meets the GenAI” on the interwebs yet! And yes, I still do produce many of my titles for articles without the use of GenAI. Which leads me to our real topic, transformers.

As computer scientists, we’ve been obsessed with human-like interaction for decades. The first versions of these chatbots were from the 1960’s with ELIZA, a rule based chatbot. Later, approaches improved from stats, to text analytics, and finally with the creation of Recurrent Neural Networks (RNNs), a type of neural network designed to recognize sequences and patterns in data.

RNN’s were particularly useful when the preceding information was necessary for predicting future outputs. The core challenge of these amazing neural networks was RNN’s process each token from the first to the last, meaning they can’t operate in parallel. The lack of parallelization significantly limits the RNN’s ability to crunch mass amounts of data and create multiple predictions. By 1997 Long Short-Term Memory (LSTM) solved the memory limitations giving AI a boost in tokenizing and making sense of long sentences, but parallelization remained a challenge.

In 2011, AI started to leap forward with the Google Brain team. Google Brain was leveraging big data for deep learning. In 2017, the team at Google published a paper titled “Attention is all you Need”, introducing the LLM powerhouse “transformers”. An LLM transformer is a deep learning model architecture that processes sequential data, such as natural language. In cartoon transformers, it’s basically the engergon cubes that power everything. The key to transformers is the “self attention” mechanism and has since become the foundation for most modern large language models (LLMs). This allows the model to weigh the importance of different words in a sentence. The key innovation of the transformer is its ability to process all parts of a sequence simultaneously, which is a major departure from earlier models like recurrent neural networks (RNNs) that processed data one token at a time. This parallelization makes training operate a blazing speed and enables the models to capture relationships between words even when they are far apart in a sentence.

By 2018, the AI race was moving at the pace that COVID was spreading! With BERT, OpenAI, Hugging Face, and LLaMA from Meta becoming widely accessible. Each version of model increasing the number of parameters it’s crunching into the billions.

I know what you’re thinking. Why didn’t everything get named after one of the transformers from the cartoon. Well, I hate to tell you the people working on this didn’t have enough of a sense of humor and unfortunately we’re left with HuggingFace, LangChain, CoPilot, Claud, GPT, and a bunch of other terms that I have yet to associate to 1980’s cartoons. Maybe I’ll build my own library and call it ThunderCats or a bot named Starscream.