Recent posts / Archive

Categories

Semivariograms

Originally posted by on 16:12 Mon 7 May 2007, last modified 02:26 Thu 17 May 2007.

File under: curve fitting geostatistics math maximum liklihood phd programming python spatial analysis

The Semivariogram is a cornerstone of geostatistics, and to understand what it means, I'll briefly introduce some terms that litter the literature of the field.

A bit of history...

If we have samples of some phenomenon, then we want to use them to generate a model that describes, or rather estimates, the phenomenon completely. There are a couple of ways this can be done, one method is genetic models, however for spatial data even simple processes can require many parameters and it is not a viable method. In the late 1960, when computers arrived on the scene, a technique called Trend Surface Analysis became possible. The assumptions underlying this technique is that the data can be represented by a polynomial (or actually any deterministic function), plus a random error component. By random error, we mean that the error is spatially uncorrelated, and does not depend on the function. The problem with this approach is that most geological or geographical phenomenon can display large short scale variation and large scale trends, and therefore uncorrelated errors, mean that the final model is overfit, because it has to model the short scale variation accurately. This leads us to consider that it might be better to allow for correlation between values at different distances apart. Hence the emergence of geostatistics.

Geostatistics

Georges Matheron, considered by many as the founder of geostatistic, coined the phrase regionalized variable, to highlight the contradictory features of these variables, namely a random aspect (for local irregularities) and a structural aspect (for large scale trends). Trend surfaces puts all the randomness into the error term, and the structure into a deterministic trem. Matheron proposed to introduce randomness as variations about a fixed surface, or drift. These variations are not errors, but rather a feature of the phenomenon.

By stationary, we mean that the variable is translation invariant. Therefore for any increment h the distribution of Z(x1), Z(x2),... Z(xk) is the same as that for Z(x1+h), Z(x2+h), ... Z(xk+h).

The Semivariogram

The semivariogram of an random function, which is akin to a stocastic process or random field in other disiplines is given as: γ(h) = 0.5 Var[Z(x+h) - Z(x)]. For stationary variables, the mean of Z(x+h) - Z(x) is zero, therefore: γ(h) = 0.5 E[Z(x+h) - Z(x)]2; here x is a position vector, for the point x in an N-dimentional space, and h is a separation vector. A typical semivariogram is shown below.

A typical semivariogram

Note the following features:

  • Theoretically it always starts at zero, as for zero separation, the two points are equal.
  • It generally increases with h, which makes sense because we expect point that are far apart to have little correlation with each other, meaning high variance.
  • It flattens out, at a value called the sill after a distance called the range

The above example is a Spherical model, which is probably the most common. Mathematically it is described by:

Spherical Model

Some other models are:

Exponential Model
Exponential

Gaussian Model
Gaussian

Cubic Model
Cubic

in all of these a and C are the range and sill respectively.

Implementation

I'll get round to posting about why we might want to determine these models in a future post, for now, I'm going to discuss how we can get them. First of all we need code that implements each model. The simplist way of doing this is to subclass a Model class, and override the function method. In Python this is very easy. We can also use Scipy's vectorize function, to turn a function that operates on scalers to operate on vectors.

from scipy import arange, vectorize

class Model(object):

    def __init__(self, type=None, sill=None, range=None, nugget=None):
        self.type = type
        self.sill = float(sill)
        self.range = float(range)
        self.nugget = float(nugget)
        self.func = vectorize(self.f)
        self.variance = 0
    
    def f(self, h):
        return 0

    def __cmp__(self, other):
        return cmp(self.variance, other.variance)
        
    def residual(self, p, xi, yi):
        self.nugget, self.sill, self.range = p
        if self.nugget < 0:
            self.nugget = 0
        err = yi - self.func(xi)
        return err
        
    def getCorrectedSill(self):
        return self.nugget + self.sill
        
    corrected_sill = property(getCorrectedSill)

            
class Spherical(Model):

    def __init__(self, sill=None, range=None, nugget=None):
        super(Spherical, self).__init__("Spherical", sill, range, nugget)
        
    def f(self, h):
        if self.range <= h:     
            if self.nugget > 0:
                return self.nugget + self.sill
            else:
                return self.sill
        else:
            h_over_range = float(h)/self.range
            if self.nugget > 0:
                return (self.nugget + self.sill*( (1.5*h_over_range) - (0.5*(h_over_range**3))))
            else:
                return (self.sill*( (1.5*h_over_range) - (0.5*(h_over_range**3))))

class Exponential(Model):

    def __init__(self, sill=None, range=None, nugget=None):
        super(Exponential, self).__init__("Exponential", sill, range, nugget)
        
    def f(self, h):
    
        arg = -(3*h)/self.range
        if self.nugget > 0:
            return self.nugget + self.sill*(1 - exp(arg))
        else:
            return self.sill*(1 - exp(arg))

class Gaussian(Model):

    def __init__(self, sill=None, range=None, nugget=None):
        super(Gaussian, self).__init__("Gaussian", sill, range, nugget)
        
    def f(self, h):
    
        arg = -3*((h/self.range)**2)
        if self.nugget > 0:
            return self.nugget + self.sill*(1 - exp(arg) )
        else:
            return self.sill*(1 - exp(arg) )        


class Pentaspherical(Model):

    def __init__(self, sill=None, range=None, nugget=None):
        super(Pentaspherical, self).__init__("Pentaspherical", sill, range, nugget)
        
    def f(self, h):
        if self.range <= h:
            if self.nugget > 0: return self.nugget + self.sill
            else: return self.sill      
        else:
            h_ovr_rng = float(h)/self.range
            if self.nugget > 0:
                return self.nugget + self.sill*( ((15.0/8.0)*h_ovr_rng) - ((5.0/4.0)*h_ovr_rng**3) + ((3.0/8.0)*h_ovr_rng**5)   )
            else:   
                return self.sill*( ((15.0/8.0)*h_ovr_rng) - ((5.0/4.0)*h_ovr_rng**3) + ((3.0/8.0)*h_ovr_rng**5)   )

You might also note that the base Model class has a residual function, which when given a tuple of parameters, and vectors of independent and dependent variables, will return a vector of errors. This is used in the fitting proceedure. I haven't actually included the Cubic, mainly because it's very similar to the Gaussian; I've also implemented the Pentaspherical model although it's rarely used. The plot below shows all the models together:

4 different variogram models

But what about some real data?

Well, if you have some observations, distributed in a space, then you've got some data. Rather than generate some, I'm going to use some real weather data from France, in particular, air temperature on the 12th April 1992. This data can be viewed using Google Maps on this page, note that it takes a little while to load. We then need to determine the separation vector for each pair of points, and we only want to consider the same two points once.

As we are concerned about anisotropy, which means how does the spatial correlation change with direction, we need sample the pairs depending on the argument of their separation vector. Generally we consider the angles 0, 45, 90 and 135, with a tollerance of 22.5 degrees. We also account for the reverse 180 degrees. We can also bin the data by the separation magnitude (or lag), and such a semivariogram is called an experimental semivariogram. This was common practice when computers we unable to fit models to hundrends of datum points, however now it's perfectly feasable to fit a model to the semivariogram cloud. That being said, binning by the lag does reduce the number of points, and is easier for a human to draw some conclusions from the graph.

A surface plot of air temperature over france

The first thing we need to do is remove the drift term, which we can do by fitting a polynomial surface to the data, which I described in this post, and then subtracting the value of this surface from the observed values. This results in a distribution of deviations, shown as the green points in the above plot. We can now compute the semivariogram... The basic algorithm is as follows:

cloud = []
for i in data
    for j in data[i:]  // this slice means we don't repeat data points
        k = i + j
        d, a = (data[i] - data[j]) // the distance and angle between two points
        cloud.add( (i, k, d, degrees(a)) )

Then depending on the angle, we can bin those datum pairs into appropriate data sets, and calculate their semivariance. These are then used to fit variogram models using maximum liklihood. The code to achieve this is show below:

def fitSermivariogramModel(self, data, model):
    """
        This will find the optimal values for the sill and range, to fit the desired model to the data.     
        in:
            data: The semivariogram data, a list of dictionaries.
            model: A function, that will return the value of the model for specific h, sill and range values.           
        out:
            aModel: the function will return the model, with optimal sill and range values set.
    """

        
    # We now need to form the arrays, for xi and yi from the data
    xi = array([d[0] for d in data])    
    yi = array([d[1] for d in data])
    
    # these are our inital guesses for the sill and range
    params = (model.nugget, model.sill, model.range)
    
    # perform the least squares
    lstsqResult = optimize.leastsq(model.residual, params, args=(xi,yi), full_output=1)
    p = lstsqResult[0]
    model.nugget, model.sill, model.range = p
        
    # work out the square deviation too
    squareDeviates = model.residual(p, xi, yi)**2
    # we divide by number of points minus the degrees of freedom
    denom = 1 / float(len(yi)-len(p))   
    model.variance = denom*sum(squareDeviates)                  
    return model
        
        
def fitModelsToData(self, models, data):
    # iterate through the models
    for m in models:
        # fit a variogram model...
        m = self.fitSermivariogramModel(data, m)                    
    models.sort()
    return models

We calculate the variance of the model, so that it can be used to compare different models, so that the "best" model can be used in future computation. Note, that determining the best model is rather a qualitative statement in this context, fitting Semivariogram models is not really an exact science, and often it takes trained eye to pick out the most appropriate model for the given data. The following graphs, show semivariograms for the same dataset, firstly to give an example of fitting models to both the semivariogram cloud, and the experimental semivariogram, for the same angle, and the best model for each angle plotted on the same graph for both the cloud and experimental. It's fairly easy to see that the parameters of the models do not differ much between the cloud and the experimental data, yet the experimental semivariogram is much easier to interpret.

Semivariogram Cloud
Semivariogram Cloud

Experimental Semivariogram
Experimental Semivariogram

The best models for all angles from cloud data
The best models for all angles from cloud data

The best models for all angles from experimental data
The best models for all angles from experimental data

What do the semivariograms tell us about the data?

If we compare the models derived from the experimental semivariogram, and the semivariogram cloud, then it's clear that they both observe the same general spatial correlations, the only significant difference is that the sill of the 135º direction is some 0.6º2 greater in the cloud data. The cloud is probably more accurate, the experimental semivariogram will be more susceptable to outliers as it's values are means.

The striking feature of these models, is that this data exhibits both zonal and geometric anisotropy. Geometric anisotropy occurs when the range but not the sill of the semivariance changes with direction; such as between angles 0º and 45º. Zonal anisotropy exists when the sill changes with direction, such as between 45º and 90º.

I'll finish this post off here, but I'll be writing more about anisotropy and what we can use these semivariogram models for in the near future.

comment

Comment on article