Big, bigger, giant. The rise of giant AI models

The evolution of language models in the field of NLP (Natural Language Processing) has led to huge leaps in the accuracy of these models for specific tasks, especially since 2019, but also in the number and scope of the capabilities themselves. As an example, the GPT-2 and GPT-3 language models released with much media hype by OpenAI are now available for commercial use and have amazing capabilities both in type, scope, and accuracy, which I will discuss in another blog post. This was achieved in the case of GPT-3 by training using a model with 750 billion parameters on a data set of 570 GB. These are jaw-dropping values.

The larger the models, the higher the cost

However, the costs of training these models are also gigantic: Taking only the stated compute costs 1 for a complete training run, the total amount for training GPT-3 is 10 million USD 2, 3. In addition, there are further costs for pre-testing, storage, commodity costs for deployment, etc., which are likely to be in a similar amount. Over the past few years, the trend of building larger and larger models has been consistent, adding about an order of magnitude each year, i.e., the models are 10x larger than the year before.

Size of NLP models from 2018-2022. Parameter sizes are plotted logarithmically in units of billions. The red line represents the average growth: approx. 10-20 times larger models per year 2.

The next model of OpenAI GPT-4 is supposed to have about 100 trillion parameters (100 x 1012 ). For comparison, the human brain has about 100 billion neurons (100 x 109) which is 1000 times less. The theoretical basis for this gigantism is based on studies which show a clear scaling behavior between model size and performance 4. According to these studies, the so-called loss – a measure for the error of the predictions of the models – decreases by 1, if the model becomes 10 times larger. However, this only works if the computing power and the amount of training are also scaled upwards.

In addition to the enormous amounts of energy required to calculate these models and the associated CO2 footprint, which is assuming worrying proportions, there are direct economic consequences: Apparently, not only smaller companies cannot afford the cost of training such models, but also larger corporations are likely to balk at costs of $10 million, or $100 million or more in the future. Not to mention the necessary infrastructure and staffing for such an endeavor.

Monopoly position of the big players

This has a direct impact on availability: while the smaller models are now open source until the end of 2019 and can be freely accessed via specialized providers, this no longer applies to the larger models from around the end of 2020 (the appearance of GPT-2). OpenAI, for example, offers a commercialized API and only grants access through an approval process. On the one hand, this is convenient for developing applications with these NLP models, as the work of hosting and administration is eliminated; on the other hand, the barrier to entry for competitors in this market is so steep that essentially the super-big AI companies participate there: Google with OpenAI, Microsoft with Deepmind, and Alibaba.

The consequences of these monopoly positions of the leading AI companies are, as with every monopoly, pricing models without alternatives and rigid business practices. However, the capabilities of the current large language models such as GPT-3 and Megatron Turing NLG are already so impressive that it is foreseeable that in 10 years every company will probably need access to the current models for the most varied applications. Another problem is that the origin of the models from the American or Chinese area brings a large bias into the models, which on the one hand is clearly expressed in the fact that English or Chinese is the language with which the models work best. On the other hand, the training datasets that come from these cultural areas bring with them the very cultural tendencies from these spaces, so it is to be expected that other regions of the world will be underrepresented and continue to fall behind..

What can be done?

In my opinion, it is important to keep a careful eye on the development and to be more active in shaping the development of AI in the European area. In any case, a greater effort is needed to avoid dependence on monopolized AI providers in the long term. It is perhaps conceivable to involve national computing centers or research alliances that, united with companies, train and commercialize their own models and form a counterweight to American or Chinese companies. The next 10 years will be decisive here.

1 See here in section D as well as compute costs per GPU e.g. on Google Cloud approx. 1USD/hour for an NVIDIA V100
2 Calculation approach: V100 = 7 TFLOPs = 7 10^12 / s, 3.14 10^23 flops => 3.14 10^23/7×10^12 / 3600 = 10^7 hours = 10 million USD, details of the calculation and research of the parameters here.
3 see also here for comparison graph with older data.
4 see arxiv and Deepmind

What is Quantum Computing good for?

When it comes to quantum computing (QC), after the quite real breakthroughs in hardware and some spectacular announcements under titles like “Quantum Supremacy“, the usual hype cycle has developed with a phase of vague and exaggerated expectations. I would like to briefly outline here why the enormous effort is being made in this area and what realistic expectations lie behind it.

To understand the fundamental differences between QC and Classical Computing (CC), we first need to take a step back and ask on what basis both computing paradigms operate. For the CC, the basis is the universal Turing machine expressed in the ubiquitous von Neumann architecture. This may sound a bit outlandish, but in principle it is easy to understand: An universal Turing machine abstracts the fact of programming any algorithm into a classical computer (universal) that is somehow (classically) algorithmically expressible (Turing machine).

The vast majority of “algorithms” that are implemented in practice are simple sequences of actions that react to external events such as mouse clicks on a web page, transactions in a web store or messages from other computers in the network. A very very small, but important, number of programs do what is generally associated with the word algorithm, which is to perform arithmetic operations to solve a mathematical problem. The Turing machine is the adapted thought model for programming these problems and leads to programming languages having the constructs we are used to: loops, branches, elementary arithmetic operations etc.

What is the computing paradigm for a quantum computer?

A quantum computer is built up of quantum states that can be entangled with each other and evolved via quantum gates. This is also a bit off the wall, but simply means that a quantum computer is set to have an initial (quantum) state that evolves in time and is measured at the end. The paradigm for a quantum computer is therefore the Schrödinger equation, the fundamental equation of quantum mechanics. Even without understanding the details, it should be clear that everyday problems are difficult to squeeze into the formalism of quantum mechanics and this effort probably does not bring any profit: Quantum mechanics is just not the adjusted model of thought for the most (“everyday”) problems and it is also not more efficient in solving them.

So what can you do with it?

The answer is very simple: QC is essentially a method for quantum computing. Now this sounds redundant, but it means that a quantum computer is a universal machine to calculate quantum systems. This vision, formulated by Richard Feynman way back in 1981, is still followed by the logic of research today. Thus, it is not surprising that publications on the subject dealing with applications are located either in quantum chemistry or in the basic research of physics [5][6].

Why does this matter?

Because the classical computer is very inefficient in calculating or simulating quantum systems. This inefficiency is basically due to the mathematical structure of quantum mechanics and will not be solved by classical algorithms, no matter how good they are. In addition to basic research issues, QC is likely to become important in the hardware of classical computers, where miniaturization is pushing the limits of designing transistors on chips using classical theories of electricity. 

Besides, there are a lot of interesting connections to number theory and other various problems, which so far can be classified as interesting curiosities. Based on current knowledge, the connection to number theory alone could have a significant impact, because for historical reasons almost all practical asymmetric encryption schemes rely on algorithms that essentially assume (there is no proof) that prime number factorization cannot be solved efficiently with classical algorithms. Quantum computers can do this in principle but are far away from being able to do so in terms of hardware.

AI – Where we are in the Hype Cycle and how it continues

While the artificial intelligence index shows that the increase of research articles and conferences in the field of AI continues, the media is slowly showing some fatigue in the face of the hype. So it’s time to take stock: What has been achieved? What is practically possible? And what is the way forward?

What has been achieved?

In the years 2018 and 2019 the previously developed methods for the application of neural networks (this is how I define AI here) were further refined and perfected. Whereas the focus was initially (2012-2016, Imagenet competition) on methods for image classification and processing and then on audio methods (2015-2017, launch of Alexa and other language assistants), major advances in text processing and generation were made in 2019 (NLP = natural language processing). Overall, the available technologies have been further improved and combined with a great deal of effort, especially from the major players (Google, Facebook, OpenAI, Microsoft).

What is practically possible?

The use of AI is still essentially limited to four areas of application:

  • Images: image recognition and segmentation
  • Audio: Conversion from speech to text and vice versa
  • NLP: word processing and generation
  • Labeled Data: Prediction of the label (e.g. price) from a set of features

This list is surprisingly short, measured by the attention AI receives in the media. The most impressive successes of AI, however, result from a combination of techniques such as speech assistants using a combination of audio, NLP and labeled data to convert the input into text, recognition of text intention with NLP and prediction of the speaker’s wish by using huge amounts of labeled data, meaning previous evaluations of similar utterances.

Decisive for the development of precisely these AI application fields were

  1. the existence of large quantities of freely available benchmark data sets (data sets for machine learning) on which algorithms have been developed and compared
  2. a large community of researchers who have jointly agreed on the benchmark data sets and compared their algorithms in public competitions (GLUE, Benchmarks AI, Machine Translation, etc.)
  3. a free availability of the developed models, which serve as a starting point for the practical application (exemplary Tensorflow Hub)

Based on these prerequisites one can quickly assess how realistic some marketing fantasies are. For example, there are neither benchmark data sets nor a community of researchers for the often strikingly presented field of application of predictive maintenance, and accordingly there are no models.

What’s next?

On the one hand, it is foreseeable that the further development in the AI area will certainly continue initially in the above-mentioned fields of application and continue to develop in the peripheral areas. On the other hand, areas are emerging which, similar to the above-mentioned fields of application, will be driven forward at the expense of large public and private funds (e.g. OpenAI and Deepmind are being subsidised by Elon Musk and Google with billions of euros respectively). An example of large investments in this area is certainly autonomous driving, but also the area of IoT. In total, I see the following areas developing strongly in 2020-2022:

  • The combination of reinforcement learning with AI areas for faster learning of models
  • A further strengthening in the area of autonomous driving resulting from the application and combination of AI and reinforcement learning
  • Breakthroughs in the generalization of the knowledge gained from image processing to 3D (Geometric Deep Learning and Graph Networks)
  • A fusion of traditional methods from statistics with neural networks
  • IoT time series (see below)

I see a big change coming with the rise of IoT and the associated sensor technology and data. By their very nature, IoT data are time series that must be filtered, combined, smoothed and enriched for evaluation. Relatively little specific has been done to date for this purpose. It could be that from 2020 – 2022, this topic could hold some surprising twists and breakthroughs for us. German industry in particular, which has benefited rather little from the initial developments in the field of AI, should find a promising area of application here.