Structural Health Monitoring & Machine Learning, Vol. 12

96 C. A. Lindley et al. yields values closely approximating their corresponding targets. In other words, to develop a function f that can accurately map a D-dimensional input vector xn ∈RD, to a target output scalar yn ∈R, i.e. a predictor f : x →y. The fit on y can be thought of as an interpolation given byf. In a statistical sense, the predictor is a function learnt from the data, albeit conditioned on certain assumptions. For example, a linear interpolant may be assumed to fit the data, which will likely result in a poor fit if the true underlying trend is strongly nonlinear. An improved fit can be achieved by increasing the order of the function; that is, a quadratic function will likely provide a better fit than the linear trend, a cubic will be better than the quadratic, and so on. At each step, the fit improves while the function becomes increasingly complex, requiring the addition of more parameters to account for the higher-order terms in the polynomial. Continuing with the inclusion of terms, however, can lead to the point where the function is said to have become too complex, and the model begins to (over)fit noise in the data. Somewhere in this process of adding terms to the predictor, an optimal amount of complexity is attained, whereby the complexity of the data is matched. Whether the learning problem is of regression or classification, finding some optimal predictor is typically approached by minimising a measure of discrepancy (or loss) between the true target values and the corresponding outputs of the predictor. To illustrate, let z = (x,y) denote an input-output pair in the training set, and Q(z,θ), θ ∈ Θbe the set of loss functions. When given a set of ni.i.d. samples z1, . . . , zn, the goal of predictive learning is to find a function Q(z,θ∗) that minimises the expected risk, R(θ)=Z Q(z,θ)p(z)dz (1) where Q(z,θ) = L(y,f(x,θ)) is some loss function, and the integral is evaluated with respect to some (unknown) joint probability distribution p(z). One should note that the set Θtowhich θ belongs can be a set of scalar quantities, vectors, or of abstract elements [5]; that is, the function space is not limited to parametric models. The current framework assumes that the joint distribution is true but unknown, and that the only information that is made available is samples drawn fromp(z). In order to minimise the risk functional R(θ), the following inductive principle is considered, Remp(θ)= 1 n nX i=1 Q(zi,θ) (2) where Remp(θ) is referred to as the empirical risk, which is evaluated with a finite sample of size n. Given that the empirical risk does not require knowing the underlying generative distribution, the idea is to seek for an estimate providing the minimum empirical risk (2), in hopes that such estimate will also minimise the true risk (1). This approach is called theEmpirical Risk Minimisation inductive principle (ERM principle) [4]. While minimising the empirical risk (2) promotes learning, a compromise must still be maintained by having a small enough risk without producing an overly complex model. This trade-off is crucial to ensuring good generalisation. A possible solution to this issue is to estimate the expected risk as a function of the empirical risk, penalised by some measure of model complexity [6]; that is, R(θ) ∼=r h n Remp (3) where the empirical risk is adjusted by a monotonically-increasing functionr, defined by the ratio of some capacity measure h, over the sample size n[7]. It turns out that the criteria for defining the penalisation function arises naturally in SLT. The foundation of such a criteria derives from demonstrating, in a rigorous mathematical framework, that minimising the empirical risk (2) can in fact yield a small value of the actual risk (1). In short, if one can show that the empirical riskconverges uniformly to the true expected risk, then one can be (almost) certain that the ERM principle leads to generalised models. This condition is presented in more detail by the following definition: Definition 1(Key Theorem of Learning Theory [4]). For a finite set of bounded loss functions, the ERM inductive principle is consistent if, and only if, the empirical risk converges uniformly to the true risk, lim n→∞ P sup θ∈Θ |R(θ) −Remp(θ)| >ϵ =0, ∀ϵ>0 (4) In the light of this theorem, a bound can be determined on the expected risk, whereby the penalised function is defined. In particular, for regression problems, the implementation of the ERM inductive principle defines the generalisation ability of a learning machine as demonstrated by the following result:

RkJQdWJsaXNoZXIy MTMzNzEzMQ==