Sequential Monte Carlo Signal Processing
8.5.1 Sequential Importance Sampling
Importance sampling is one of the most well-known and
elementary Monte Carlo techniques. Suppose that we want to make an inference
about some random quantity X ~ p (X) using the
Monte Carlo method. Sometimes drawing samples directly from p(X) is difficult,
but it may be easier (or otherwise advantageous) to draw samples from a trial density, say, q(X). Note that
the desired inference can be written as
Equation 8.70
Equation 8.71
where
Equation 8.72
is called the importance weight.
In importance sampling, we draw samples X(1), X(2), ..., X(n)
according to the trial distribution q(·). We then
approximate the inference (8.71) by
Equation 8.73
This technique is widely used, for example, for reducing the
sample-size requirements in BER estimation.
However, it is usually difficult to design a good trial density
function in high-dimensional problems. One of the most useful strategies in
these problems is to build up the trial density sequentially. Suppose that we
can decompose X as X = (x1, ..., xd), where each of the xj may
be either a scalar or a vector. Then our trial density can be constructed as
Equation 8.74
by which we hope to obtain some guidance from the target
density while building up the trial density. Corresponding to the decomposition
of X, we can rewrite the target density
as
Equation 8.75
and the importance weight as
Equation 8.76
Equation (8.76) suggests
a recursive way of computing and monitoring the importance weight. That is, by
denoting Xt = x1, ..., (xt)
(thus, Xd
X), we have
Equation 8.77
Then wd (Xd) is
equal to w(X) in (8.76). Potential advantages of this recursion and (8.75) are that we can stop generating
further components of X if the partial
weight derived from the sequentially generated partial sample is too small, and
that we can take advantage of pt(xt|Xt-1)
in designing qt(xt|Xt-1).
In other words, the marginal distribution p(Xt) can
be used to guide the generation of X.
Although the idea above sounds interesting, the trouble is that
the decomposition of p(X) as in (8.75) and that of w(X) as in (8.76) are not practical at all! The reason is that in
order to get (8.75), one needs to have
the marginal distribution
Equation 8.78
whose computation involves integrating out components xt+1,
... xd in p(X) and is as difficult as—or even more difficult
than—the original problem.
To carry out the sequential sampling idea, we need to introduce
another layer of complexity. Suppose that we can find a sequence of auxiliary distributions,
so that pt(Xt) is
a reasonable approximation to the marginal distribution pt(Xt) for
t = 1,..., d – 1
and pd(X) = p(X). We emphasize
that {pt(Xt)}
are required to be known only up to a normalizing constant, and they serve only
as "guides" to our construction of the entire sample X = (x1,...,xd).
The sequential importance sampling (SIS) method can then be defined as noted in the following
recursive procedure.
Algorithm 8.7: [Sequential
importance sampling (SIS)] For t = 2,...,d:
Equation 8.79
and let wt = wt-1 ut.
In the SIS step, we call ut an incremental
weight. It is easy to show that Xt is
properly weighted by wt with respect
to pt provided that Xt-1 is
properly weighted by wt-1 with respect to pt-1. Thus,
the entire sample X obtained in this
sequential fashion is properly weighted by the final importance weight, wd, with respect to the target density p(X). One reason for
the sequential buildup of the trial density is that it breaks a difficult task
into manageable pieces. The SIS framework is particularly attractive, as it can
use the auxiliary distributions p1, p2,..., pd to help
construct more efficient trial distributions:
Equation 8.80
Then the incremental weight becomes
Equation 8.81
-
When we observe that wt is getting too small, we can choose to
reject the sample halfway and restart. In this way we avoid wasting time on
generating samples that are doomed to have little effect in the final
estimation. However, as an outright rejection incurs bias, the rejection control
techniques can be used to correct such bias [279].
-
Another problem with SIS is that the resulting importance
weights are still very skewed, especially when d
is large. An important recent advance in sequential Monte Carlo to address this
problem is the resampling technique [160, 276].
8.5.2 SMC for Dynamical Systems
Consider the following dynamical system modeled in state-space
form as
Equation 8.82
where zt, yt,
ut, and vt are,
respectively, the state variable, the observation, the state noise, and the
observation noise at time t. These quantities can
be either scalars or vectors.
Let Zt = (z0, z1,..., zt) and let Yt =
(y0, y1,..., yt).
Suppose an online inference of Zt is of interest; that is, at current time t we wish to make a timely estimate of a function of
the state variable Zt, say h(Zt),
based on the currently available observation, Yt.
From Bayes' formula, we realize that the optimal solution to this problem is
Equation 8.83
In most cases an exact evaluation of this expectation is
analytically intractable because of the complexity of such a dynamical system.
Monte Carlo methods provide us with a viable alternative to the required
computation. Specifically, if we can draw m
random samples
from the distribution p(Zt|Yt), we
can approximate E{h(Zt)|Yt}
by
Equation 8.84
Very often, direct simulation from p(Zt|Yt) is
not feasible, but drawing samples from some trial distribution is easy. In this
case we can use the idea of importance sampling discussed above. Suppose that a
set of random samples
is generated from the trial
distribution q(Zt|Yt). By
associating the weight
Equation 8.85
with the sample
, we can approximate the quantity of
interest, E{h(Zt)|Yt},
as
Equation 8.86
with
Equation 8.87
The pair (
), j =
1, ..., m, is called a properly weighted sample with respect to distribution
p(Zt/Yt). A
trivial but important observation is that the
(one of
the components of
is also properly weighted by the
with respect to the marginal distribution p(zt/Yt).
Another possible estimate of E{h(Zt)|Yt}
is
Equation 8.88
The main reasons for preferring the ratio estimate (8.86) to the unbiased estimate (8.88) in an importance sampling framework
are that the estimate (8.86) usually has
a smaller mean-square error than that of (8.88), and that the normalizing constants of both the
trial and target distributions are not required in using (8.86) (where these constants are canceled out); in such
cases the weights {
} are evaluated only up to a
multiplicative constant. For example, the target distribution p(Zt|Yt) in
a typical dynamical system (and many Bayesian models) can be evaluated easily up
to a normalizing constant (e.g., the likelihood multiplied by a prior
distribution), whereas sampling from the distribution directly and evaluating
the normalizing constant analytically are impossible.
To implement Monte Carlo techniques for a dynamical system, a
set of random samples properly weighted with respect to p(Zt|Yt) is
needed for any time t. Because the state equation
in system (8.82) possesses a Markovian
structure, we can implement a recursive importance sampling strategy, which is
the basis of all sequential Monte Carlo techniques [277]. Suppose that a set of properly
weighted samples
with respect to p(Zt–1|Yt–1)
is given at time (t–1).
A sequential Monte Carlo filter
generates from the set a new one,
, which is properly weighted at time
t with respect to p(Zt|Yt).
The algorithm is described as follows.
Algorithm 8.8: [Sequential
Monte Carlo filter for dynamical systems] For j =
1,... , m:
Equation 8.89
The algorithm is initialized by drawing a set of i.i.d. samples
from p(z0/y0). When y0 represents the "null" information,
p(z|y0) corresponds to the prior
distribution of z0. The samples
and weights drawn by the algorithm above are characterized by the following
result, the proof of which is given in the Appendix (Section 8.7.3).
Proposition 8.1: The weighted samples generated by Algorithm 8.8 satisfy
Equation 8.90
Equation 8.91
The result above, together with the law of large numbers,
implies that
Equation 8.92
There are a few important issues regarding the design and
implementation of a sequential Monte Carlo filter, such as the choice of the
trial distribution q(·) and the use of resampling
(see Section 8.5.3). Specifically, a
useful choice of the trial distribution
, for the state-space model (8.82) is of the form
Equation 8.93
where in (8.93) we used
the facts that
Equation 8.94
Equation 8.95
both of which follow directly from the state-space model (8.82). For this trial distribution, the
importance weight is updated according to
Equation 8.96
Equation 8.97
where (8.96) follows
from the fact that
Equation 8.98
and the last equality is due to the conditional independence
property of the statespace model (8.82).
See [277] for the
general SMC framework and a detailed discussion on various implementation
issues. SMC techniques have been employed to tackle a number of problems in
wireless communications and networks, including blind detection in fading
channels [76], blind
detection in OFDM systems [593], and joint mobility tracking and
hand-off detection in cellular networks [595].
8.5.3 Resampling Procedures
The importance sampling weight
measures
the quality of the corresponding imputed sequence
. A
relatively small weight implies that the sample is drawn far from the main body
of the posterior distribution and has a small contribution in the final
estimation. Such a sample is said to be ineffective. If there are too many ineffective samples,
the Monte Carlo procedure becomes inefficient. This can be detected by observing
a large coefficient of variation in the
importance weight. Suppose that
is a sequence of importance weights.
Then the coefficient of variation, nt, is
defined as
Equation 8.99
with
Equation 8.100
Note that if the samples are drawn exactly from the target
distribution, all the weights are equal, implying that vt = 0. A
large coefficient of variation in the importance weights indicates ineffective
samples. It is shown in [232] that the importance weights
resulting from an SMC filter form a martingale sequence. As more and more data
are processed, the coefficient of variation of the weights increases—that is,
the number of ineffective samples increases—rapidly.
A useful method for reducing ineffective samples and enhancing
effective ones is resampling, which was suggested
in [160, 276] under the SMC
setting. Roughly speaking, resampling allows those "bad" samples (with small
importance weights) to be discarded and those "good" ones (with large importance
weights) to replicate so as to accommodate the dynamic change of the system.
Specifically, let
be the original properly weighted
samples at time t. A residual resampling strategy forms a new set of
weighted samples
according to the following
algorithm (assume that
).
Algorithm 8.9: [Residual
resampling]
The correctness of the residual resampling procedure above is
stated by the following result, whose proof is given in the Appendix (Section
8.7.4).
Proposition 8.2: The samples drawn by the residual resampling procedure (Algorithm 8.9) are properly weighted with
respect to p(Zt|Yt)
for m

Alternatively, we can use the following simple resampling
procedure, which also produces properly weighted samples with respect to p(Zt|Yt):
-
For j = 1,..., m, draw i.i.d. random numbers lj from the
set {1,2,..., m}, with probability
distribution
Equation 8.101
Equation 8.102
In practice, when small to modest m is used (we used m =
50 in our simulations), the resampling procedure can be seen as trading off
between bias and variance. That is, the new samples with their weights resulting
from the resampling procedure are only approximately proper, which introduces
small bias in Monte Carlo estimation. On the other hand, however, resampling
greatly reduces Monte Carlo variance for the future samples.
Resampling can be done at any time. However, resampling too
often adds computational burden and decreases "diversities" of the Monte Carlo
filter (i.e., it decreases the number of distinctive filters and loses
information). On the other hand, resampling too rarely may result in a loss of
efficiency. It is thus desirable to give guidance on when to do resampling. A
measure of the efficiency of an importance sampling scheme is the effective sample size
, defined
as
Equation 8.103
Heuristically,
reflects the equivalent size of a
set of i.i.d. samples for the set of m weighted
ones. It is suggested in [277] that resampling should be
performed when the effective sample size becomes small (e.g.,
).
Alternatively, one can conduct resampling at every fixed-length time interval
(say, every five steps).
8.5.4 Mixture Kalman Filter
Many dynamical system models belong to the class of conditional
dynamical linear models (CDLMs) of the form
Equation 8.104
where ut ~ Nc(0,I),
vt ~ Nc(0,I), and
lt is a
random indicator variable. The matrices Flt, Glt, Hlt, and Klt are known, given lt. In this model,
the state variable zt corresponds to (xt,lt).
We observe that for a given trajectory of the indicator lt in a CDLM, the
system is both linear and Gaussian, for which the Kalman filter provides the
complete statistical characterization of the system dynamics. Recently, a novel
SMC method, the mixture Kalman filter (MKF), was developed in [75] for online filtering and
prediction of CDLMs; it exploits the conditional Gaussian property and uses a
marginalization operation to improve the algorithmic efficiency. Instead of
dealing with both xt and lt, the MKF draws Monte Carlo samples only in the
indicator space and uses a mixture of Gaussian distributions to approximate the
target distribution. Compared with the generic SMC method, the MKF is
substantially more efficient (e.g., it gives more accurate results with the same
computing resources). However, the MKF often needs more computational power for
its proper implementation, as the formulas required are more complicated.
Additionally, the MKF requires the CDLM structure, which is not applicable to
all problems of interest.
Let Yt = (y0, y1,..., yt) and
let Lt = (l0,
l1,..., lt). By recursively generating a set of properly
weighted random samples
to represent p(Lt|Yt),
the MKF approximates the target distribution p(xt/Yt) by
a random mixture of Gaussian distributions
Equation 8.105
where
is obtained by implementing a
Kalman filter for the given indicator trajectory
. Thus, a
key step in the MKF is the production at time t
of the weighted samples of indicators,
, based on the set of samples,
, at the previous time (t - 1) according
to the following algorithm.
Algorithm 8.10: [Mixture
Kalman filter for conditional dynamical linear models] For j = 1,...m:
Equation 8.106
The MKF can be extended to handle the partial CDLM, where the state variable has a linear
component and a nonlinear component. See [75] for a detailed treatment of the
MKF and the extended MKF.