# Complexity & Information Theory

This article explains the basic concepts of algorithmic information theory, sometimes known as Kolmogorov complexity.

## Introduction

Many alternative mathematical frameworks exist for quantifying the notions of complexity or information, including the algorithmic information theory discussed here, the classical information theory of Shannon and Weaver (1949), logical depth (originally discussed in print by Chaitin 1977; also see Bennett 1982, 1987a, 1990), and others. Most share formal similarities with the others as well as with Bayesianism. For additional information and on-line tutorials about algorithmic information theory, visit the home page of my colleague Greg Chaitin, one of its inventors.

The central idea of algorithmic information theory is to quantify the information content of an object or bit string in terms of its shortest description. In other words, if an object can be described easily in a short space, it is of low complexity or information content, while if describing it takes more space, then it is of higher complexity or information content. It can be useful to think of the ‘shortest description’ as a kind of ‘self-extracting archive’ of the target string, so that a string’s information content is the size of it once it has been compressed into a self-extracting archive. One of many interesting features of algorithmic information theory is that it turns out that in the space of all possible strings, hardly any strings can actually be compressed at all.

## Basic Definitions of Information Theory

We define the algorithmic information content I(x) of an object (or bit string) x as the length of the shortest self delimiting program(s) for a Universal Turing Machine U which generates a description of that object for a given level of precision. (Self-delimiting programs, or ‘instantaneous code’, aren’t allowed to rely on a blank marker on the Turing Machine tape to signify the end of the program and thus must include within themselves an indication of their own length.) An infinite class of different algorithms can generate any finite bit string, but we are interested in a minimal program, one which is not itself the output of another which is still shorter. Notice that the choice of a particular U cannot introduce more than O(1) difference in the numerical value of I(x), since any prefix s which makes U simulate another computer is of constant length.

The framework of algorithmic information theory was developed independently in the 1960s by R.J. Solomonoff (1964; first discussed in print by Minsky 1962) for quantifying the power of scientific theories; by A.N. Kolmogorov (1965), then of the Soviet Academy of Sciences; and by Gregory J. Chaitin (1966, 1977) while he was an 18-year old undergraduate at the City University of New York. The account here primarily follows Chaitin’s conventions.

Consider two strings:

1111111111111111111111111111111111111111

0111011010010100101000110110111010010100

Intuitively, the first string, forty ‘1’ digits, contains less information than the second, an apparently patternless batch of ‘1’ and ‘0’. Alternatively, the first doesn’t seem as random as the second. The algorithmic information content measure preserves this intuition, because the internal pattern of the first means it can be specified succinctly with a program instructing a computer just to “print ‘1’ forty times”. The string’s algorithmic information content, or aglorithmic complexity, is bounded from above by the length of the binary sequence specifying ‘forty’, together with the code which tells the computer to print a digit a certain number of times. Equivalently, the bulk of its information content comes from its length; the sequence making up that length, being a simple ‘1’, is comparatively unimportant. In general, since it takes about log2(n) binary digits to specify n, the complexity of such a regular string xn, of length n, satisfies:

I(xn) = I(n) + O(1) = O(log2(n))

Because log2(n) grows much more slowly than n, programs for generating such regular strings grow much more slowly than the strings themselves. Comparatively little separates a program which instructs “print ‘1’ forty times” and another which instructs “print ‘1’ four hundred times”. Put differently, strings of this first type are highly compressible. The underlying pattern makes compressing the string into a shorter space very easy. Of course, many other strings are also highly compressible by virtue of slightly trickier patterns. For instance, the binary expansion of pi is of paradigmatically low algorithmic information content because we can specify a program for generating its digits concisely. The same is true for the many other cryptoregular strings such as the binary expansion of the square root of 2.

But the shortest program for calculating the second string above, the apparently patternless set of characters, seems to be one which just prints it out directly:

“Print ‘0111011010010100101000110110111010010100’.”

The absence of organisation which can be easily specified makes the string virtually incompressible. Indeed, almost all strings share this property, and in general their information content, again where xn denotes a string of length n, satisfies:

I(xn) = n + I(n) + O(1) = n + O(log2(n))

The information content of such an incompressible string comes from its length as well as the actual sequence making up that length. (Recall that self delimiting programs contain within themselves information about their own length.) A string is random, or algorithmically complex, when its information content satisfies the above — i.e., when the minimal program for generating it isn’t much shorter than the string itself, so its information content approximately equals its length. (Again, log2(n) grows much more slowly than n, so the length of xn comes to dominate I(xn).) Randomness is normally a vague concept, but by replacing the word ‘approximately’ with a specific value, algorithmic information theory offers a quantitative choice as to when a string should be dubbed random. It’s clear, incidentally, that any minimal program is also random. Why? Suppose minimal program p generates x. If p weren’t random, then by definition some substantially shorter program r generates p. The string x could therefore be generated by a program specifying “calculate p from r, and then use p to calculate x“. Since that program is only slightly larger than r and is thus shorter than p, p cannot have been a minimal program.

The notion of randomness can be extended to random strings: an infinite string is random when its finite initial segments xn of length n are random.

It’s easy to see that most strings have near maximal algorithmic information content (equivalently, most strings are random). Basic cardinality considerations illustrate why. For example, there are only 2n – 20 – 2 programs with length in the range [1…n – 20], each of which generates at most one string of length n. So the proportion of strings of length n with complexity less than n – 20 is less than 2n – 20/2n , or roughly one in a million: fewer than one in a million strings can be compressed by twenty bits or more. In general, the proportion of strings of length n with information content less than n – c decays exponentially with increasing c. Another of these short introductory papers explains why, despite the fact that algorithmically complex strings outnumber their simpler cousing by a wide margin, proving that given strings are random is generally impossible.

## Joint, Relative, and Mutual Information

Algorithmic information theory extends easily to relationships between two objects.

We define the joint information I(x, y) of two objects x and y as the length of the minimal program which computes x and y together. Joint information is bounded from above by the sum (plus a constant) of the individual information contents. The conditional, or relative information content I(x | y) is the size of the minimal program to calculate x given a minimal program for generating y. Since a program can always, at the least, just calculate x while ignoring y, the relative information content of an object x with respect to anything else is bounded from above by I(x). Conditional information content offers another angle on randomness: maximally complex strings have I(xn | n) close to n, while for highly ordered strings the same quantity approaches zero. The two measures are related by the fact that joint information content is equivalent to the sum of the lengths of the programs to compute first one string and then the other from it:

I(x, y) = I(x) + I(y | x) + O(1) = I(y) + I(x | y) + O(1)

Making this work as above depends on our stipulating that we are given a minimal program for generating y. The snag is that without some change, the error bounded above with O(1) becomes unbounded. Why? Suppose, as above, that

I(x, y) = I(x) + I(y | x) + O(1)

and, instead of y, ask about the joint information content of a string and its information content:

I(x, I(x)) = I(x) + I(I(x) | x) + O(1)

But

I(x, I(x)) = I(x) + O(1)

This is because a minimal program for producing x already includes within it a specification of its own length; thus the length of the minimal program which outputs both x and its information content is only slightly longer than a program for producing x alone. Moreover, clearly

I(I(x) | x) <> O(1)

In fact, without the special stipulation, I(I(x) | x) is unbounded (Chaitin 1975). That the size of the minimal program to calculate I(x) from x is unbounded connects nicely with the material on incompleteness.

The degree to which I(x, y) < I(x) + I(y) suggests a measure of mutual information I(x : y), defined as the extent to which knowledge of one object helps in the calculation of the other — i.e., the extent to which they can be calculated together more succinctly than separately. The relationship is symmetrical:

I(x : y) = I(x) I(x | y) + O(1) = I(y) – I(y | x) + O(1)

= I(x) + I(y) – I(x, y) + O(1)

= I(x) + I(y) – I(y, x) + O(1)

Like many of Chaitin’s modern definitions for algorithmic information theory, a slightly different ensemble version of mutual information content appears in the original classical information theory. While it doesn’t feature in the present discussion, Gács and Körner (1973) also suggest an interesting related measure of common information, the largest information content of a string which can be calculated from each of two strings by programs bounded in length by a given value.

Finally, two strings are algorithmically independent when their mutual information is as small as possible. My book Mind Out of Matter sets out some ways in which we might extend basic information theory further to ground the troublesome notion of representation.

## Completeness and Halting

Among other applications, algorithmic information theory offers powerful tools for understanding the fundamental limits of mathematical reasoning and computation. See the other short paper on classical computability, completeness, and halting.