Information Theoretically Attractive General Measures of Processing Capability — And Why They Are Undecidable

This 1997 paper attempts to make precise some possible ways of quantifying an organism’s, computer’s, or other system’s general processing power.

Picturing the system for convenience as a black box situated between an input stream and an output stream, general processing capability is taken to be reflected by the system’s capacity either to extract information content from its input stream or to add information to that input stream, where these capacities are subsequently expressed in the system’s output stream. While such a measure may be made precise and rigorous in information theoretic terms, an unfortunate side effect is that the statements made possible by the measure are generally undecidable. Furthermore, although upper bounds for the different information contents underlying this type of measure may be obtained empirically, the distance from empirically established upper bounds to the real upper bounds remains formally undecidable.

1. Preliminary note on algorithmic information theory

Recall that the algorithmic information content, or algorithmic complexity, I(x) of an object (or string) is defined as the length of the minimal self delimiting program for a Universal Turing Machine U which generates a description of that object for a given level of precision. Note that the choice of a particular U never introduces more than O(1) difference in the value of I(x), since a program prefix s which makes U simulate some other particular Turing Machine is of constant length. (For any other computer M, a program p for M can be prefixed with a fixed length segment of code s for U such that sp causes U to perform the same computation as p for M. The fixed length segment simply tells U how to simulate the behaviour of M.)

The principal advantage of the algorithmic complexity measure is its immunity to the arbitrary variation in a string’s length which may be introduced by altering the method used to encode, compress, or expand the string. For instance, a rendition of the Bible’s text re-written to include 4 billion extra blank spaces between each character will have only a minuscule amount of additional information content, relative to the normal Bible text.

2. Precise general measures of processing capability

On first glance, it seems helpful to measure a system’s general processing capability in terms of its ability to extract information from its inputs or its ability to take those inputs and add information to them. One crude way of making this precise is simply to measure the algorithmic information content of the system itself; although this is almost certain to over-estimate the system’s capability, since one would normally expect the system to be architecturally unable to reflect that full information content in its outputs, it does provide an absolute upper bound.

In addition to a lack of availability for manipulation of inputs, however, the system’s absolute information content may also not be the right information content to be computationally valuable: even if it is available, the information may not underwrite any ability to process the system’s inputs in a useful fashion. (For instance, a database of average skirt lengths from 1920 to the present may be singularly unhelpful if a system must process tax returns.) With this weakness in mind, the rough general measure may be made more sophisticated, at the cost of relativising it to a particular input domain, by appealing to the formal measure of mutual information content between two objects x and y. Mutual information content I(x : y) is defined as the extent to which knowledge of one object helps in the calculation of the other. This is equivalent to the extent to which they can be calculated together more succinctly than separately. Since the relationship is symmetrical,

I(x : y) = I(x) – I(x | y) + O(1) = I(y) – I(y | x) + O(1)

= I(x) + I(y) – I(x, y) + O(1)

= I(x) + I(y) – I(y, x) + O(1)

Here, I(x | y), the conditional information content, is defined as the size of the minimal program to calculate x from a minimal description of y, while I(x, y) is the joint information content of two objects x and y, defined as the length of the minimal program for computing x and y together.

Mutual information content may be exploited in many different ways to create measures which more appropriately reflect the information a system may extract from its input stream or the amount of information it may add to it. For instance, one might consider the three mutual information relations between inputs, outputs, and the system itself, as well as the relationships between them. Just as one example (and ignoring O(1) adjustments for the moment), the ratio between, on one hand, a system’s mutual information content with some input domain and, on the other, its overall information content, quantifies the proportion of the system’s information content which is information about its environment (where that environment is understood to be reflected by the input domain). Likewise, one crude reflection of the amount of information a system adds to its input stream is the conditional information relation between a given input domain and the corresponding output range. Many other varieties of measures may be concocted atop the general framework outlined here, with particular examples tuned to reflect on an absolute scale each significant component of a system’s overall ‘power’ to manipulate information.

Information theoretic measures of this type carry attractive advantages in terms of logical structure, precision, and absoluteness. They may be understood as the underlying formal substance of loosely specified goals such as measuring how efficient a system is at compressing information, how well it extracts patterns from background noise, with what degree of complexity it can transform its inputs, and so on.

3. The little problem of decidability

3.1 Hard limits of computation and proof

I have taken no great care to describe in detail the range of available information theoretic measures of processing power, however, because it turns out that their advantages come at a price-one which for many purposes is too high. Namely, they are in the general case formally undecidable.

Recall that Chaitin’s generalisations of Gödel’s seminal work on incompleteness show that the algorithmic information content n of a particular string cannot be proven within any formal system unless that formal system itself contains axioms (actually, the right axioms) with information content greater than or equal to n c, where c is a fixed constant. Alternatively, it is not possible to prove with n bits of axioms any string’s information content that is greater than a fixed number of bits above n.

Of course, we may establish empirically an upper bound on the information content of an object simply by coming up with any old description of it. The minimal self delimiting program for generating a description of such an object obviously cannot be any longer than a self delimiting program which just prints out that description directly. As it happens, considered in the space of all possible strings, most strings cannot be generated any more succinctly than simply printing them out directly; most strings, where xn denotes a string of length n, satisfy:

I(xn) = n + I(n) + O(1)

= n + O(log2(n))

We say a string is random, or algorithmically complex, when its information content satisfies the above-i.e., when the minimal program for generating it isn’t substantially shorter than the string itself, when its information content approximately equals its length. (This last holds because log2(n) grows much more slowly than n, so the length of xn comes to dominate I(xn).) The dominance of random strings might be made intuitively more obvious by considering the basic cardinality relationship between strings of a given length and strings of a shorter length. For instance, fewer than one in a million strings can be compressed by twenty bits or more, because the set of strings twenty bits shorter than a given length x is roughly only one millionth the size of the set of strings of length x.

Given that random strings enjoy such a commanding majority in the space of all strings, it might seem it should be easy to exhibit a provably random string of any specified length-but of course this is impossible. (Note that in general, the problem of determining a string’s information content is formally equivalent to the halting problem.) Likewise, it might seem that empirically obtained upper bounds on information content should be very close to real information content-or at least that they should differ from them by only an easily estimated amount. If we knew in advance that we were concerned only with genuinely random strings, this would be trivially true.

But the trouble is that we are not. When it comes to objects whose information processing capability we might like to measure, we are nearly always concerned with highly ordered objects either designed by human beings or shaped by the course of evolution. And, unfortunately, the function a(n), defined as the largest natural number with information content less than or equal to n, grows more rapidly than any computable function. (This is the information theoretic equivalent of S(n), the busy beaver function.) In other words, the relationship between a nonrandom string’s length and its information content can vary astronomically. This is one of Nature’s little jokes. It means that with only n bits of axioms, calculating the set of all strings of complexity less than n bits requires extremely long proofs-on the order of a(n). The alternative extreme is to assume all that we want to prove, which, not surprisingly, gives more concise proofs, but the number of bits of axioms we must assume can easily outstrip the number of particles in the cosmos. The upshot is that empirically obtained upper bounds on information content are useful primarily for showing when a string is nonrandom and not for offering much glimpse into real information content.

3.2 Extrapolating to other formal frameworks

Note that any measure of general processing capability which claims to offer the benefits of measures described here (in terms of absoluteness, precision, etc.) must also be formally undecidable in the general case. Why? If a candidate measure were decidable and bore such a straightforward resemblance to the measures described here, it would offer a decidable method of evaluating undecidable propositions, which is impossible. Alternatively, no measure which is itself decidable can bear a computable relationship to the measures described here.

If someone claims to offer a decidable measure which reflects general computational power of the sort outlined here, we can be certain that either 1) their measure is subject to wild deviations caused by variations in input or output encoding or 2) they are a few bits short of a full string.

4. The empiricist’s objection

Consider the following objection from our mate the hard headed empiricist, (who also happens to be an avid fan of pattern recognition technology):

But wait, this is all too counterintuitive! Surely if I measure a system’s pattern recognition capability, I’m measuring it’s ability to extract information from its inputs; a system which more quickly, more accurately, or more generally recognises patterns surely is more powerful than another which is slower, less accurate, or can recognise fewer patterns. What’s wrong with simply choosing a big set of patterns (facial images, say) and then ranking systems based on their ability to recognise those patterns? This would be real data, not just some logical gymnastics! Furthermore, what if I measure a system’s ability over a whole range of tests like this, including visual pattern recognition, auditory pattern recognition, and so on? Couldn’t I thereby build up an empirically grounded measure of general processing capability which was independent of all this formal fishiness?

Much is attractive about this empirical line, but such a measure will remain firmly anchored to the particular patterns offered in the input domain, and it will be impossible to extrapolate between systems which operate on different patterns. (And worse, there are no general purpose pattern recognizers. There are only particular systems which either evolved or were manually designed to recognise particular patterns.) Why? We have no way of measuring the absolute information content of the input patterns, so we have no way of quantifying the difficulty of the problem! Claims (either relative or absolute) about the general computational difficulty of problem sets cannot be backed up, because we cannot measure their algorithmic complexity. The empiricist has simply pushed the general problem of decidability back one stage from the system itself to the problem set-but it has not gone away, and the move takes the game not one bit closer to a decidable finish. As in the case of general measures of intelligence (see the short companion paper ‘”Intelligence” is a Bad Word’), we can measure capability with respect to particular problem niches, but we cannot extrapolate or compare between them.

Many objections to the line of thought explored in this paper are like the above: verbally, they seem very convincing, but analysed information theoretically, the scenario described turns out either to be impossible or something markedly different than what was promised. Some may find it unfortunate or disappointing that problems of incompleteness bear on all formal systems of reasonable power, but either way it is an impenetrable fact of life.

5. Discussion and alternatives

I have argued above that any formal measure of general processing power is either subject to wild deviations due to variations in the encoding of inputs or outputs, or it is in the general case formally undecidable. The conclusion I would like to draw from this is not that formal measures of processing power are entirely worthless, but simply that their effective use must be informed by a firm grasp on the mechanics of incompleteness. The hard limits dictated by decidability considerations strongly constrain the range of sound arguments which may be constructed with such formal measures; it is remarkably easy to generate apparently convincing but entirely fallacious arguments in this area if these limits are not kept in mind.

I have previously spent some time exploring formal measures related to those described here and have developed a measure which I believe does a better job of quantifying the complexity of an input/output process than those outlined here do of quantifying processing capability. The measure, called functional logical depth (an extension of Charles Bennett’s ordinary logical depth, which is in turn an extension of Greg Chaitin’s information content), links process complexity to the length of time taken by a concise program to mimic the behaviour of a physical system’s input/output relationship over a specified input domain and output range. I have applied the measure to ground a re-construction of the problematic notion of functional decomposition, allowing us to speak precisely about a system’s internal organisation into areas with differing functional responsibilities. Like other such measures, however, functional logical depth is undecidable; no one will build a functional logical depth meter any more than they will build an automatic halting problem solver.

This article was originally published by on and was last reviewed or updated by Dr Greg Mulhauser on .

Mulhauser Consulting, Ltd.: Registered in England, no. 4455464. Mulhauser Consulting does not provide investment advice. No warranty or representation, either expressed or implied, is given with respect to the accuracy, completeness, or suitability for purpose of any view or statement expressed on this site.

Copyright © 1999-2021. All Rights Reserved.