5.26
CHAPTER 5 The Expectation Operator
Definition 5.1 Let be an n-D random variable with sample space SX and with pdf . Let be any function of . The expectation operator E(*) is defined as
(5.1)
The integral (1) is an n-D integral (or sum, if is discrete). It could be argued that (5.1) is the only equation one might need in order to understand anything and everything related to expected values. In this chapter we will attempt to convince the reader that this is the case.
The standard approach to this topic is to first define the mean of X, then the variance of X, then the absolute moments of X, then the characteristic function for X, etc. Then one proceeds to investigate expected values related to the 2-D random variable (X,Y). Then one extends the approach to expected values related to the n-D random variable, . Then (if the reader is still enrolled in the class L) expected values of functions of these random variables are addressed. Often, students become a bit overwhelmed with all the definitions and apparently new concepts. Indeed, all of these concepts are contained in (5.1). To be sure, (5.1) is a ‘dense’ concept. However, if the student can understand this one concept, then he/she will not only feel entirely comfortable with all of the above-mentioned concepts, but the student will be able to recall and expand upon it for years to come. Our approach to engendering a solid understanding of (5.1) will proceed by considering it in the 1-D setting, then the 2-D setting, then the n-D setting.
5.1 Expected Values of Functions of 1-D X
Consider the following examples of the function g(*).
Example 5.1
(i) g(x) = x: . This is called the mean value for X.
(ii) g(x) = x2 : . This is called the mean squared value for X.
(iii) g(x) = (x – μX)2 : . This is called the variance of X.
(iv) g(x) = eiωX : . This is called the characteristic function for X.
(v) g(x) = ax2 + bx + c: .
However, here we can simplify this expression in terms of the above defined parameters:
.
Hence, we obtain : . □
The above examples all follow the same procedure; namely, to compute E[g(X)], you simply integrate g(x) against fX(x). To test your level of understanding, see if you can use (1) to prove the following result:
Result 1.
Proof: . [Done in class]
Example 5.2 Suppose that we are given X, with mean μX and standard deviation σX. Let Y = X2.
(a) Use the concept of equivalent events, to express in terms of :
. (5.2)
(b) Use the chain rule for differentiation to express in terms of :
Write , where we defined and .
Then . Specifically,
. (5.3)
(c) For X ~ Uniform(a,b) where , compute the pdf for .
Recall that for . Define the indicator function .
Then , and so (5.3) becomes
for . (5.4)
The last equality in (5.4) follows from the fact that, since , for .
Numerical Values: Let . Then and . To verify that this expression is, indeed, a pdf, we note that
. This pdf is plotted below, as is a histogram-based estimate of it, based on 105 random numbers.
(d) Use (5.4) to directly compute μY.
.
(e) Use (ii) of Example 5.1 and Result 1 to compute .
where, from a standard table: and .
Hence,
.
Remark. The approach in (e) is far easier, mathematically, than the method in (c)/(d). It takes direct advantage of (5.1), without the need for the pdf of Y. □
Example 5.3 Suppose . Let .
(a) Use the result (v) of Example 5.2 to show that , and that .
.
Note: These results hold for any pdf. Nowhere did we use the normality of Z.
(b) Use the concept of equivalent events to express the cdf for X in terms of the cdf for Z.
.
Remark. It follows that we can use the N(0,1) cdf to compute probabilities related to X, should we desire. □
Example A1. (added 2/13/14 for Valentine’s Day fun J).
Let . Then from (iv) above, the characteristic function for X is:
.
It also happens that the Fourier transform of a function of x, call it , defined on , is defined as:
.
Hence, we see that is simply the Fourier transform of [with the slight difference in that it is computed at instead of .]
Question: So, why is this important?
Answer: It is important because the Fourier transform of a function of x does not require that function to have a model form. If we have a model form for , then the form for can be found in most tables associated with that model. However, suppose that we do not have a model form for . In particular, suppose that we have only a scaled histogram-based estimate for . We can still compute the Fourier transform of this function.
Question: Fair enough. But why, pray tell, would I want ?
Answer: The answers are ‘legion’. But here are two motivational ones:
Answer 1: The structure of might be ‘better-behaved’ than that of , in the sense that it might admit a better model than that afforded by .
Answer 2: This answer is more mathematical, but also more commonplace. Suppose that X and Y are independent, and that . The question here is: How do and relate to ? The answer is:
.
The operation * is called convolution. Clearly, it is an operation that involves computing an integral for each specified value of z ( a lot of work!). However, it turns out that convolution in the x-domain is equivalent to multiplication in the ω-domain:
.
Hence, a simple alternative to using convolution to get is to take the Fourier transforms of and , multiply them, and then take the inverse Fourier transform of that product. The Fourier transform algorithm is common in many data analysis programs, and so the computation requires only four simple commands: (i) the FT of , (ii) the FT of , (iii) multiplication of these FT’s, and (iv) the IFT of this product.
Example A2.
Numerical Example coming to a theater near you SOON J
5.2 Expected Values of Functions of 2-D X=(X1, X2)
Consider the following examples of the function g(*):
(i) g(x) = x = (x1,x2): .
In words then, what we mean by the expected value of a vector-valued random variable is the vector of the means.
(ii) g(x) = x1x2 : .
Definition 5.2 Random variables X1 and X2 are said to be uncorrelated if: E(X1 X2 ) = E(X1 )E(X2 ). (5.4)
They are said to be independent if: . (5.5)
Theorem 5.1 If random variables X1 and X2 are independent, then they are uncorrelated.
Proof:
Thus, we see that if two random variables are independent, then they are uncorrelated. However, the converse is not necessarily true. Uncorrelatedness only means that they are not related in a linear way. This is important! Many engineers assume that because X and Y are uncorrelated, they have nothing to do with each other (i.e. they are independent). It may well be that they are, in fact, very related to one another, as is illustrated in the following example.
Example 5.4 Modern warfare in urban areas requires that projectiles fired into those areas be sufficiently accurate to minimize civilian casualties. Consider the act of noting where in a circular target a projectile hits. This can be defined by the 2-D random variable (R, Ф) where R is the radial distance from the center and Ф is the angle relative to the horizontal right-pointing direction. Suppose that R has a pdf that is uniform over the interval [0, ro] and that Ф has a pdf that is uniform over the interval [0, 2π). Thus, the marginal pdf’s are:
and .
Furthermore, suppose that we can assume that these two random variables are statistically independent. Then the joint pdf for (R , Ф) is: .
(a) The point of impact of the projectile may be expressed in polar form as . Find the mean of W.
Solution: Since W=g(R,Ф), where g(r,φ)=reiφ, we have
= 0.
(b) The point of impact may also be expressed in Cartesian coordinates; that is: X = Rcos(Ф) and Y = Rsin(Ф). Clearly, X and Y are not independent. In fact, X2+Y2 = R2. Show that they are, nonetheless, uncorrelated.
Solution: We need to show that E(XY) = E(X)E(Y). To this end,
= 0 ,
= 0 , and
.
To compute the value of the rightmost integral, one can use a table of integrals, a good calculator, or the trigonometric identity sin(α+β) = sin(α)cos(β) + cos(α)sin(β). We will use this identity for the case where α = β = φ. Thus, cos(φ)sin(φ)=. From this, it can be easily shown that the rightmost integral is zero. Hence, we have shown that X and Y are, indeed, uncorrelated.
Before we leave this example, it might be helpful to simulate projectile hit points. To this end, we will (only for convenience) choose ro = 1. The, to simulate a measurement, r, we use r = rand(1,1). Similarly, to simulate a measurement, φ, we use φ = 2π rand(1,1). Consequently, we now have simulated measurements x=rcos(φ) and y=rsin(φ). The scatter plot below shows 1000 simulations associated with (X, Y).
Figure 5.1 Simulations of (X, Y) = (Rcos(Ф), Rsin(Ф) ).
Notice that there is no suggestion of a linear relationship between X and Y. In fact, the sample correlation coefficient (computed via corrcoef(x,y) ) is -0.036. Even so, it is worth repeating: X and Y are not independent. Specifically, since , then . And so, .
In words, this equation says that the conditional cdf for Y is not equal to the unconditional cdf for Y. Hence, X and Y are not independent. □
Application to Linear Models- We begin this topic with a method of modeling a quantity, y, as a linear function of a quantity y. This method involves only data. No random variables or related theory is present.
Linear Prediction via Least Squares: Suppose that we have a data set , and we desire a model to be able to predict a value, y, given a value x. Consider the linear model:
.
The goal of the method of least squares is to find the expressions for the unknown parameters m and b, that minimizes the sum of squared errors:
where . To this end, we express as:
.
In this expression the ‘variables’ are m and b. The elements are known data. To find the expressions for m and b that minimize , we need to set the partial derivatives and each equal to zero:
We can simplify the notation by dividing each equation by n. We then have
From the second equation we have:
. (5.6a)
Substituting this into the first equation gives
. Hence,
. (5.6b)
The connection between least squares and random variables: The LS method that produced the expressions (5.6) is purely numerical. There are no random variables in sight. However, we can relate it to random variables. Specifically, consider the 2-D random variable (X,Y), and let be data collection random variables associated with the generic (X,Y). Then the model of concern is:
. (5.7)
Consistent with this setting, we have: , , , and . We also know that , and . And so, in this random variable setting equations (5.6) become
and . (5.8)
The value of the random variable setting is that it allows one to use tools from this course to evaluate how good the estimators of the parameters m and b are!
Example 5.5 Suppose that you have decided to carry out an analysis of the relationship between vehicle speed and cabin noise level. Specifically, you are interested in being able to predict the cabin noise level from vehicle speed. To this end, you collected 100 samples of speed/noise data. A scatter plot of the data is given below. The least squares prediction line is included.
Figure 5.2 Scatter plot of speed (mph) versus cabin noise (dBA) for n = 100 measurements. The prediction model is:
.
Now to the crucial question: How good IS the prediction model?
There are two ways to respond to this question. One is to search textbooks for possible formulas related to the uncertainty of the estimators . The other is to run a large number of simulations, assuming that the estimators associated with the model are the true values, and that (X,Y) is a 2-D normal random variable. The former method is far beyond the scope of this course, and would typically require a graduate student level of understanding of probability and statistics. The simulation method is entirely appropriate within the scope of this course. For this reason, we will carry out the simulation method.
For each of 10,000 simulations, we will use the Matlab command mvnrnd to generate a sample of 100 measurements of (X,Y). This will result in 10,000 measurements of the 2-D random variable . Having these will allow us to estimate means, standard deviations, and even the correlation coefficient. The shapes of the pdf’s for and a scatter plot describing their relation to each other are shown below.
Figure 5.3 The shapes of the pdf’s for and a scatter plot describing their relation to each other.
It should be no surprise that the estimated correlation coefficient between is so large (-0.99). After all, one is derived almost directly from the other. The model (5.7) presumes perfect knowledge of (m,b); whereas, in fact, we use estimators in the model:
.
Since , we can express this model as: .
Now, suppose we take a measurement, x. We then have: .
It follows that: .
From analysis of the scatter plot of we can deduce that they are uncorrelated. Hence, . And so the conditional mean value is:
.
The simulations gave, and . These are essentially the true values that were used to run the simulations. What was new here was the discovery that: