249

General Conclusion

A number of social sciences, as we have seen, were born at the same time as probability and now routinely use its concepts. These play an essential role in population sciences and in fields such as epidemiology and economics. However, the connection is not always as close in other social sciences.

The first part of this conclusion will describe the current situation more specifically in sociology and in artificial-intelligence, a science using mainly nonprobabilistic methods in the past.

This last theory using causal diagrams, the notions of counterfactual causality and of structural equations, will lead us to examine in broader terms how different causality theories fit into the social sciences.

We shall then return to the notions of individual and levels before discussing how probabilistic reasoning is incorporated into the forecasting of individual and collective behavior.

In this General Conclusion, we shall therefore need to address these topics in greater detail. Although the scope of our book precludes an exhaustive treatment, we offer some suggestions for more clearly assessing the situation in a larger number of social sciences.

Our epilogue summarizes the main findings of our study, the issues that still need to be addressed, and the pathways toward a fuller analysis of societies.

Generality of the use of probability and statistics

in social science

In our detailed examination of the history of population sciences over three and a half centuries, we have seen how strongly their concepts and methods depended on the notions of probability and statistics, which emerged almost simultaneously. Although the links may have seemed looser at certain moments, population scientists, probabilists, and statisticians cooperated closely most of the time. Often, it was the same scientist who, like Laplace, designed the probabilistic methods, developed the appropriate statistics, and applied them to population issues (see Chapters 3 and 4).

In Chapters 1, 2, and 3, we saw how other social sciences, as well, relied heavily on probability and statistics for tackling certain problems. Those disciplines include, together with population sciences, economics, epidemiology, jurisprudence, education sciences, and sociology. Admittedly, we have not examined them in depth, and it is possible that they may not always need probability in their work.

For instance, we have shown (Chapters 1 and 4) that Durkheim’s sociology required the concomitant-variation method, i.e., linear regressions, to establish causality relationships (Durkheim, 1895):

We have only one means of demonstrating that a phenomenon is the cause of another: it is to compare the cases where they are present or absent simultaneously and to determine if the variations that they display in these different combinations of circumstances are evidence that one depends on the other.

In his study on suicide (Durkheim, 1897), for example, he observed that suicide rates varied with the local percentage of Protestants, and he deduced the more general conclusion that:

[s]uicide varies in inverse proportion to the degree of integration of religious society.

He showed that the same reasoning applied to domestic and political society. To explain suicide, he therefore sought a cause common to all these societies:

Now the only one that meets this condition is that these are all strongly integrated social groups. We therefore arrive at this general conclusion: suicide varies in inverse proportion to the degree of integration of the social groups to which the individual belongs.

In other words, his demonstration, while based on probability, transcends the probabilistic approach in order to identify the more general causes of a specific sociological phenomenon: suicide.

The same is likely true in other social sciences, but we can also assume that while many use probability calculus, some do not make it their prime method. We have seen this assumption confirmed in sociology; below, we shall examine whether it also applies to artificial intelligence.

Another point is that some approaches used in population sciences are common to other social sciences as well.

For instance, the event-history approach, whose probabilistic bases we have shown to be essential, is used not only in many social sciences, but in mechanics and physics, as it applies to the more general study of phenomena occurring over time. Examples for which it is perfectly suited include: measuring task performance in psychological experiments; medical and epidemiological studies on the development of diseases; studies on the durability of manufactured parts and machines; studies on the length of strikes and unemployment spells in economics; and studies on the length of traces left on a photographic plate in particle physics.

Likewise, the multilevel approach—which studies data that are ranked hierarchically or belong to different levels—is widely used in education sciences, medical sciences, organization sciences, economic, epidemiology, biology, sociology, and other fields. Here as well, scientists use characteristics measured at different aggregation levels in their search for an overall treatment of a more general problem posed by the existence of levels in all sciences. These methods, too, are based on probability, and in particular the crucial notion of exchangeability.

However—like Durkheim, who sought to generalize the results obtained with the aid of regression methods—most social sciences aim beyond the mere observation of statistical regularities, identified with the aid of probabilistic and statistical models. Hence the importance of intensifying the search for whatever tools can supplement the use of probability in the social sciences.

Shafer (1990) clearly frames the problem of the limits of the application of probability to certain social sciences:

An understanding of the intellectual content of applied probability and applied statistics must therefore include an understanding of their limits. What are the characteristics of problems in which statistical logic is not helpful? What are the alternatives that scientists, engineers, and others use? What for example are the characteristics of problems for which expert systems should use nonprobabilistic tools of inference?

He suggests that we should seek the reasons for the use of these non-probabilistic methods in certain sciences: ‘We must, for example, understand the nonprobabilistic methods of inference for artificial intelligence […].’ Accordingly, we shall review the situation in artificial intelligence, but not in the same detail as we have analyzed population sciences.

While the origins of artificial intelligence go back to Antiquity, it is once again Pascal (1645) who, with his arithmetic machine, stands out as one of the true forerunners of the science:[1]

[T]he instrument compensates the failings due to ignorance or lack of habit, and, by performing the required movements, it executes alone, without even requiring the user’s intention to do so, all the shortcuts of which nature is capable, and every time that the numbers are arranged on it.

Although he does not actually claim that the machine can think, he does note that it can perform operations without memory errors, particularly all arithmetical calculations regardless of complexity.

However, it was not until the twentieth century that ways were found to formalize arithmetical reasoning, then set theory, by means of Gödel’s incompleteness theorems (1931), Turing’s machine (1936), and Church’s Lambda calculus (1932). First, Gödel’s two incompleteness theorems showed that those axiomatized theories contain true but unprovable expressions. Second, Turing’s machine, similar to a computer but with no limitations on its memory space, made it possible to analyze a problem’s effective computability. Lastly, Church’s Lambda calculus provided a formal system for defining a function, applying it, and repeating it recursively.

This sequence paved the way for artificial intelligence with Turing’s article (1950) envisaging the creation of machines endowed with true intelligence. In its most outspoken form, artificial intelligence refers to a machine capable not only of producing intelligent behavior, but also of experiencing true self-consciousness and of understanding its own logic. Let us now examine some stages in the development of the science and their connections to probability.

Solomonoff had elaborated a general theory of inductive inference. Taking a long sequence of symbols that contained all the information to be used in an induction, he sought to design the best prior distribution of the following symbol (Solomonoff, 1964a, 1964b).[2] He relied especially on Turing’s work. Interestingly, many probabilists largely overlooked this theory of algorithmic probability for a very long time: as we shall see later, symbolic logic was the main qualitative tool for representing intelligence before 1980.

Solomonoff’s method is based on the following principle. Let us take, for instance, the sequence of numbers 2, 4, 6, 8 and try to determine the probability distribution of the following number. It should be noted that very often—for example, in IQ tests—the respondent is asked to give the following number directly, not the distribution. Indeed, when we examine the sequence, we immediately assume that the nth term should be 2n. In principle, therefore, the answer for the fifth term is 10. But in fact there are many sequences that begin with the same four terms. For example, the sequence expressed by the formula also begins with the first four numbers and yields another solution to our problem: 98. Why, then, do we regard the first formula as the most likely? No doubt because we unconsciously apply the principle of Occam’s razor: ‘entities must not be multiplied beyond necessity’.[3] To solve this problem, we thus need to consider all possible solutions and give their distribution. More specifically, it is preferable to weight each of these answers using a function reflecting the complexity of each. The function may consist of Kolmogorov’s complexity,[4] , defined as the length of the shortest description of the sequence s in a universal description language such as Church’s Lambda calculus, used by a Turing machine. Solomonoff defines a prior algorithmic probability, on the space of all possible binary sequences, equal to , where the sum applies to all descriptions of infinite sequences starting with the string x. Of this probability’s many properties, the most interesting is that the sum of quadratic errors in the set of sequences is limited by a constant term, which implies that the algorithmic probability tends toward the true probability when faster than .

Unfortunately, the method’s main drawback is that the model is generally incomputable—or rather is calculable only asymptotically—because Kolmogorov’s complexity is incomputable as well. However, there are proxy solutions that make allowance for the calculation time and, under these assumptions, offer a partial solution to the problem.

This theory is applicable to many problems in artificial intelligence, using probability distributions to represent all the relevant information for solving them. Solomonoff (1986) applies the theory to passive-learning problems, where the fact that a current prediction by the agent is correct or not has no impact on the future series. But we need to go one step further and examine the general case of an agent capable of performing actions that will affect his or her future behavior. Hutter (2001) extended Solomonoff’s model to active learning, combining it with sequential decision theory. This allowed the development of a very general theory applicable to a large class of interactive environments.

However, the forecasts based on this broader theory are limited not only by the fact that the model is usually incomputable, but also by the fact that the convergence for the algorithmic probability may not be possible in certain environments (Legg, 2008). It therefore remains an ideal but unattainable model for inductive inference in artificial intelligence.

In fact, most artificial-intelligence specialists have long viewed symbolic logic as the ideal tool for representing intelligent knowledge and solving problems. For this purpose, symbolic logic relied on essentially qualitative methods. Shafer and Pearl (1990) described this period as follows:

Ray Solomonoff, for example, has long argued that AI should be based on the use of algorithmic probability to learn from experience (Solomonoff, 1986). Most of the formal work in AI before 1980s, however, was based on symbolic logic rather than probability theory.

At the beginning of the 1980s, however, many artificial-intelligence specialists came to realize that symbolic logic would never be able to describe all human processes, such as perception, learning, planning, and form recognition. By the mid-1980s, researchers were developing truly probabilistic methods to address these issues (Pearl, 1985).

Pearl’s theories initially focused on Bayesian networks. He introduced the term, and the networks themselves, in an article published in 1985:

Bayesian networks are directed acyclic graphs in which the nodes represent proportions (or variables), the arcs signify the existence of direct causal dependencies between the linked propositions, and the strengths of these dependencies are quantified by conditional probabilities. A network of this sort can be used to represent the deep causal knowledge of an agent or a domain expert and turns into a computational architecture if the links are used not merely for storing factual knowledge but also for directing and activating the data flow in the computations which manipulate this knowledge.

Pearl elaborated the theory in a book (Pearl, 1988) that used the graphs to represent the dependency structures occurring in a number of multivariate probability distributions. Let us see in greater detail how this matching is achieved.

When we analyze human reasoning, we aim to identify the mechanism whereby people integrate data from different sources in order to arrive at a coherent interpretation of them. We can always plot a graph showing these data—or, rather, these propositions—and the links between them. We can then observe that the dependency graph forms a tree structure with nodes representing the propositions and links, arrowed or not, between the propositions that we regard as directly connected. For example, Figure1 (Shafer and Pearl, 1990) shows how a doctor:

combines evidence from a physical examination and a health history to get a judgement about how much at risk of heart disease the patient is, and then he or she combines this with the patient’s description of an apparent angina episode to get a judgement about whether the patient really has angina.

From this figure, Shafer and Pearl conclude that:

Physical examination and Health history are conditionally independent of Episode description given Risk.

Figure 1. Diagnostic of angina