Model Calibration & Verification
7.1 Introduction
7.1.1 Scope
This section describes model calibration. Topics covered include:
- Concepts of model calibration (definitions, calibration process)
- Calibration targets, calibration parameters
- Parameter estimation techniques
- Evaluation of calibration results (goodness of fit criteria)
- Model verification (testing of calibrated model).
7.1.2 Definitions
Model calibration is defined as the process of refining the numerical model's representation of the hydrogeological framework, hydraulic properties, and boundary conditions to achieve a desired degree of correspondence between the model simulation and observations of the groundwater flow system (ASTM, 2008).
The following are two definitions of a calibrated model:
- A calibrated model is a model that has achieved a desired degree of correspondence between the model simulations and observations of the physical hydrogeological system (ASTM, 2008).
- A calibrated model adequately represents the system conditions such that an answer to the question posed by the modeller or the regulator is possible (Woessner and Anderson, 1996).
Observed values , observations or sample data, describing the state of the groundwater system measured in the field or the properties of the system. Examples of observed values include the elevation of the water in a piezometer, the flow rate at a spring, or the concentration of contaminants in a water quality sample. Field parameter values usually refer to hydraulic conductivity, transmissivity, and storage properties determined from analysis of hydraulic tests.
Calculated values are the output from the numerical model. Examples include hydraulic pressure distributions, flow rates, and contaminant concentrations. In a predictive model, calculated values are the "predicted values".
Calibration parameters are those model parameters (hydraulic properties or boundary conditions) whose values are adjusted during the calibration process.
Calibration targets are observed values which are matched to corresponding calculated values during model calibration process. For a transient model this is sometimes called "history matching" .
7.1.3 Calibration Process
The calibration process involves refining the hydrogeological conceptual model and the numerical model parameters to achieve the desired degree of correspondence between the model simulation results and the observations of groundwater flow system. The degree of difficulty of model calibration depends on the amount and the quality of measured data, the complexity of processes being simulated, and the complexity of the conceptual model. Modeller experience and computing power also has significant effect.
There is always some initial conceptual model of the groundwater flow system, and usually some parts of this model are adjusted during this process. For a given conceptual model configuration, the calibration process consists of adjustments to hydraulic parameters and recharge fluxes within a reasonable range of values. If the hydraulic properties adjustment fails to provide adequate calibration result, the conceptual model is modified again (e.g. changing a boundary condition).
The choice of evaluating alternative conceptual models and calibrating each model depends on project objectives and budget. There is always more than one conceptual interpretation (feasible conceptual model) of a groundwater system, and it is important to evaluate many lesser known aspects of the conceptual model (Poeter et al, 2008). However, in most modelling projects usually only one conceptual model is presented along with only one final calibration result. It is acceptable to present only one calibrated model based on one selected conceptual model configuration, but the assumptions and limitations of alternative conceptual models should be at least discussed. Where it is not clear which conceptual model is the most valid, and there are high risks and consequences to model results, it may be advisable to evaluate two or more conceptual models. This would entail constructing alternative mathematical models, to calibrate each separately, and to present alternative predictive models.
For the purpose of this section it is assumed that the modeller has a robust and representative conceptual model before proceeding to implement the calibration guidelines of this section. If multiple conceptual models are considered reasonable, each should be carried through calibration for use during uncertainty analysis (see section 8). Case study 2 provides a good example of the use of multiple conceptual models (refer to Section 7.7.1).
The following aspects of model calibration should be documented to facilitate review of model calibration:
- The details of calibration statistics are presented, including a complete description of calibration residual distribution.
- Major assumptions about data interpretations are listed in a prominent place in the model report. Model calibration limitations are discussed or listed (e.g. explanation of large calibration residuals).
- An initial sensitivity analysis to calibration parameters and ranking of most important parameters (see Section 8 for more details on model uncertainty and rigorous sensitivity analysis process).
- There is a good discussion on the alternative conceptual models as relating to model calibration results.
7.1.4 Calibration Parameters
The following is a list of calibration parameters commonly used for model calibration:
- Hydraulic properties are selected by identifying zones of similar aquifer hydraulic properties based on geology and aquifer testing.
- Recharge flux is estimated based on regional or local analysis of precipitation and water balance, ground cover type, elevation, soil properties, etc.
- Discharge flux of groundwater to surface drainage network.
- Other groups of inputs that can be parameterized for the project objectives.
For each calibration parameter, the range of possible realistic values that parameter may have in the physical hydrogeological system should be identified prior to model calibration.
7.1.5 Calibration Targets
The most common observed values used as calibration targets in groundwater flow models are:
- Hydraulic heads (water table elevation in unconfined aquifer, potentiometric surface or pressure distribution in confined units) at one or more points in space, and may include multiple observed heads in many hydrogeological units.
- Groundwater flux as observed discharging to surface, creeks, lakes or mine workings (pits, underground workings). Other fluxes sometimes used for model calibration include net infiltration to water table (recharge), observed seepage losses (from ditches, streams, lakes), and observed volumes pumped from wells or injected into an aquifer.
- Water density or salinity in density-dependent flow models.
- Concentrations of contaminants in contaminant transport models (see Section 9).
Temperature in groundwater flow and heat flow coupled models are not included in these guidelines, however, water temperature is a useful natural tracer which is used in transient analyses to estimate groundwater recharge sources and lag times.
7.1.6 Calibration Data Requirements
The observed data used as calibration targets must have sufficient spatial distribution for all models, and sufficient temporal distribution for transient models. A large number of uniformly distributed calibration targets, each having small associated error, will increase the likelihood of obtaining a unique calibration, as will the use of groundwater fluxes.
The following is a list of common problems with calibration target data distributions:
- Hydraulic head data points are clustered and not widely distributed across the site in the model domain. A typical example are exploration boreholes, which were also tested and monitored in ore deposit areas, or clusters of engineering boreholes in proposed waste rock dump or other engineering study areas.
- Monitoring wells may also be present in easy to access areas such as valley bottoms, especially near existing roads and streams. Steep slopes and mountain tops are typically not monitored due to difficulties in monitoring well installation, but the most important data for model calibration of recharge-driven model is in the steep slopes and mountain tops.
- Having many points along rivers or lakes does not help the calibration because the surface waters are usually represented with boundary conditions and the model head is set (not calibrated) in those areas.
- Groundwater discharge fluxes (base flows in streams) may be monitored in adjacent catchments, or too far downstream from the model site and affected by runoff from other catchments.
- Carefully planned field programs may not anticipate site complexities which are evaluated later in the modelling process (e.g. faults in fractured rocks).
The modeler (and model reviewer) should be conscious of these limitations with potential calibration target data and take these limitations into consideration when selecting calibration targets and evaluating model calibration.
7.1.7 Calibration Data Quality
Difficulties in calibrating the numerical model to observed field data may indicate a problem with the quality of the monitoring data. In principle, all monitoring data should have been checked prior to model calibration. However, some data errors will not be apparent until they show significant discrepancies with the simulated response during model calibration.
The error bounds and calibration targets should be set before the calibration process. Sources of error in each calibration point should be assessed and quantified to assign relative weights to data points before starting calibration.
Data quality issues that should be considered include:
- Representativeness of water level
- Transient variation of hydraulic head
- Positional survey error
- Monitoring piezometer design uncertainty
- Water level measurement error
- Representativeness of discharge measurements (locations of measurements)
- Streamflow measurement method error (base flow measurement may be difficult if the streams receive continuous runoff from precipitation or ice melt)
- Assumptions for hydrograph analysis (streamflow data)
- Dewatering system statistics (e.g. from underground mine) or open pit water level changes (mine inflow); groundwater extraction statistics (for pumping systems)
- Representativeness of recharge measurements (locations of measurements)
- Assumptions in infiltration modelling (soil types, materials used as covers, geology, scale, precipitation inputs, climate, etc.)
- Temporal variation in water density and salinity (e.g. during active salt water intrusion in coastal aquifers, tides)
- Misinterpretations of geophysical survey results.
Appendix F provides more details for each data error type.
7.1.8 Steady-state vs Transient Calibration
Steady-state simulations are used to model equilibrium conditions representing the "average" hydrological balance, or conditions where aquifer storage changes are not significant.
Transient simulations are used to model time-dependent problems, and/or where significant volumes of water are released from or taken into aquifer storage (for example a pumping test, highly seasonal flow field).
Data requirements for transient models depend on the modelling objectives, but in general there should be transient data on the same time scale, and with the same temporal resolution as the modeled stress and duration, such as:
- The data set used for transient calibration should include pumping test data, and/or sufficient duration of regular monitoring data that shows the natural seasonal variations and responses to artificial stresses applied during natural resource extraction projects.
- The transient data should be available for several spatially distributed representative locations throughout the model domain.
- Different hydrostratigraphic units should be tested and several tests in different locations to have more confidence in model calibration uniqueness.
Case study 3 provides a good example of the use of steady-state and transient calibration (Section 7.7.2)
7.2 Calibration Techniques
Calibration of a numerical model may be done by manual trial-and-error or by automatic parameter estimation methods, or a combination of the two.
7.2.1 Manual Trial-and-Error Calibration
In trial-and-error calibration, the modeller changes the model input parameters manually in order to improve the correlation between model output parameters and field parameter values.
Manual trial-and-error calibration may proceed by changing one parameter at a time (similar to a sensitivity analysis, see Section 8) or by trying different combinations of parameters. Manual trial-and-error calibration is labour-intensive and time consuming but is the most common method of calibration.
Manual trial-and-error calibration gives the modeller significant insight into the factors controlling the system and should always be part of model calibration, in particular during the early stages when the conceptual model has not been finalized.
7.2.2 Automated Parameter Estimation
Automated parameter estimation involves the use of one or more computer codes specifically developed to undertake model calibration. The following are inverse software codes developed by the USGS for use with MODFLOW:
- UCODE_2005 - MODFLOW-2000 Observation, Sensitivity, and Parameter-Estimation Processes (Poeter et al, 2005).
- PEST - Parameter Estimation (Doherty, 1994, 2005).
- Various user interfaces to MODFLOW such as Visual MODFLOW (SWS, 2011), Groundwater Vistas (Scientific Software Group, 2011) and GMS (Aquaveo, 2011) also have PEST capabilities (see Section 5).
Extensive guidelines for automated and effective model calibration have been developed by Hill and Tiedeman (2007). Difficulties in automated calibration are resolved as follows (Poeter et al, 2005):
- Reconsider the conceptual model.
- Confirm the accuracy of field data.
- Modify the input parameters.
- Use a different numerical code.
Automated parameter estimation should be used only after completion of at least some initial manual calibration to: (i) confirm that the conceptual model is reasonable; and (ii) to bracket the range of model parameters to be varied in automated parameter estimation.
The use of automated parameter estimation techniques requires significant specialized experience by the modeller. If not used properly, this method may yield incorrect results (very good calibration statistics but incorrect parameter distributions) or no result at all (non-convergence).
7.2.3 Non-Uniqueness
Non-uniqueness during model calibration arises because many different sets of model input parameters can produce nearly identical model outputs (Brown, 1996). Any combination of groundwater flow rates and hydraulic conductivities input to the model that has the same ratio as the actual flow rates and hydraulic conductivities in the aquifer will produce nearly identical hydraulic head distributions as output. Hence, a good matching of measured and modeled hydraulic heads during calibration does not guarantee that the hydraulic properties used in the model are close to those actually found on site.
Non-uniqueness cannot be eliminated but it may be reduced. Methods to address the non-uniqueness problem include:
- Restrict the range of input parameters to values that are consistent with field values.
- Calibrate the model to a range of distinct hydrological conditions (e.g. seasonal climate variation and extreme conditions, and ranges of induced stresses.
- Use measured groundwater flow rates (e.g. stream base flow) as calibration targets (in addition to hydraulic heads).
- Use data that has sufficient spatial and temporal distribution.
Figure 7-1 illustrates the concept of non-uniqueness, as well as the value of calibrating to multiple datasets. The different lines associated with each dataset represent possible hydraulic conductivity and flow combinations that would provide the same calibration result.
The area within the red circle represents a more constrained range of flow and K value combinations that would fit multiple datasets. While multiple combinations are still possible, the overall calibration is improved. Model confidence improves as more datasets are used to calibrate the model. Therefore, the variety and distribution of available data are key model attributes to be considered by the modeller and reviewer.

Figure 7-1: Addressing the non-uniqueness problem (after Ritchey and Rumbaugh, 1996).
7.3 Evaluation of Calibration Results (Goodness of Fit)
This section describes methods for evaluating the calibration results. This includes a discussion of calibration acceptance criteria and descriptions on various qualitative and quantitative methods for comparing field measurements to the same parameter as calculated with the model.
7.3.1 Calibration Acceptance Criteria
The calibration acceptance criteria refer to model goodness of fit using quantitative (statistical) thresholds and qualitative goodness of fit requirements. These criteria are project-specific, depend on the modelling objectives and data availability, and should be defined before model calibration. Subsequent changes in calibration criteria should be justified.
The regulator reviewing a model should examine calibration criteria or targets to determine that the proponent develops a valid, robust, and rigorous model, based on an appropriate conceptual model and calibration procedures. These guidelines support the use of a-priori specified quantitative acceptance criteria used in combination with qualitative calibration performance measures.
These guidelines do not set prescriptive calibration criteria. It is recognized that the advantages of prescriptive calibration criteria include:
- Unambiguous performance measure to judge model calibration.
- Desirable for regulating agencies, as it sets out the required performance specifications.
Prescriptive criteria may, however, involve the following disadvantages that make them less desirable than project-specific criteria:
- A potential overemphasis on or even erroneous calibration. This happens if a modeller adjusts aquifer properties to ensure a better match of simulated heads with field observations, when in fact the field data are wrong.
- Achievement is contingent on model complexity, which in turn depends on geological knowledge, data availability and quality, deadline, budget, and model complexity.
7.3.2 Qualitative Calibration Evaluation
Examples of spatial distributions used in qualitative evaluation of model calibration include:
- Patterns of groundwater flow based on modeled contour plans of aquifer heads.
- Patterns of aquifer response to variations in hydrological stresses (hydraulic head hydrographs).
- Distributions of model aquifer properties adopted to achieve calibration.
A qualitative evaluation of model calibration can also take into account the specific location of calibration targets, and potential differences in their relative contribution to model calibration. For example, residuals for head values near specified head boundaries are constrained by the nearby boundary and cannot vary easily in the model simulation, while heads near hydraulic boundaries can vary greatly as a result of variation in hydraulic properties.
7.3.3 Statistical Calibration Evaluation
There are many methods for quantitatively evaluating the goodness of fit between measured and modeled parameters. The proponent is encouraged to select statistical methods applicable to a specific groundwater modelling project. The following are considered the minimum statistical evaluations that should be reported.
7.3.3.1 Residuals
Mathematically a residual is simply the difference between a measured and a calculated value (or between a calculated and a measured value). In groundwater modelling, residuals may be calculated by comparing measured versus calculated heads, flow rates, constituent concentrations, or any other reasonably comparable parameter. Generally in groundwater flow modelling, residuals are calculated for head, and thus in this subsection reference in made only to hydraulic head.
The hydraulic head residual (ri ) is the difference between the calculated (modeled) head value (hc ) and the measured head value (hm ) at point i. It may be expressed by either of these equations:
ri = hc - hm or ri = hm - hc
Weighing coefficients (W i ) can be applied to account for confidence in the data quality. Wi can vary from 0 to 1. If all points are weighed equally, Wi is equal to 1 for each point i . Poor quality measurements may be excluded, or in case of clusters of points, the most representative and best quality measurement used and others in the cluster excluded.
Methods for establishing acceptable residuals include:
- Judgment
- Kriging (variance estimate at each observation point)
- Trend analysis (heterogeneous aquifers defined by sub-regions only).
In general, the residual should be a small fraction of the difference between the highest and lowest heads across the site (ASTM, 2008). Acceptable residuals may differ for different hydraulic head calibration targets.
7.3.3.2 Normalization
In order to standardize average measures with different units or scales, a non-dimensional normalized measure is used as in the following equation:

7.3.3.3 Average Residual Error Measures
The Root Mean Squared Error (RMSE) and the Normalized RMSE as in these equations may be used:

The Mean Absolute Error (MAE), or the Normalized Mean Absolute Error (NMAE) is also commonly used:

In mountainous regions, normalized errors may be more appropriate to account for significant variation in water level elevations. Alternatively, different measurements may be used to remove effects related to elevation. For example, calibration may be based on depth to water instead of water level elevation. The same residual error measures listed above can be used for either type of water level measurement.
7.3.3.4 Correlation Coefficient
The correlation coefficient (R) is a measure of the correlation of two data sets. R2 is the coefficient of determination. R calculation requires the mean and standard deviation of the observed and calculated values and evaluates the residuals relative to the mean. Both R and R2 vary from 0 to 1 (1 meaning perfect linear relationship).
In hydrogeological modelling, a model is considered calibrated when the correlation coefficient is at least 0.95. However, R is very sensitive to outliers (very large positive or negative residuals), and often high R values may result if the data has strong auto-correlation (e.g. hydraulic heads correlated to topography on a high topographic relief site). Also, an R value is meaningful if there is a randomly distributed scatter of residuals and a sufficient large number of points (e.g. a high R statistic based on two calibration points indicates a perfect calibration to insufficient data set).
7.3.4 Graphical Calibration Evaluation
These guidelines recommend the use of graphics to enhance the qualitative evaluation of model calibration. The following subsections provide examples of graphical means to evaluate (and illustrate) model calibration.
7.3.4.1 Scatterplots
To show that there is no systematic error in the spatial distribution of differences between modeled and measured heads, the modeller should compile a scattergram (scatterplot). Scatterplots of residuals are graphical representations of goodness of fit of individual calibration targets (usually head values) associated with average error statistics. These plots show measured hydraulic heads on the horizontal axis, and modeled hydraulic heads on the vertical axis, with one point plotted for each pair of data at observation points. All the points should occur with a minimum degree of scatter about the line of perfect fit. Scatterplots are useful in detecting:
- Outliers
- Clustering
- Trends.
The scatterplots should be clearly presented and the graph scales clearly visible. Some model calibration plots appear to show good fit and the residuals are small in appearance on graph, but the magnitude of residuals depends on the graph scales. Confidence intervals should also be plotted.
Many other types of plots may be used to demonstrate the spatial distribution of error; examples are given in modelling text books (e.g. Hill and Tiedeman, 2007; Anderson and Woessner, 1992; and Spitz and Moreno,1996).
Figure 7-2 a-d illustrate types of scatter plots and histograms used at two sites with high topographic relief. In Figure 7-2a, results suggest a good fit with normalized residuals of 4.9% and R2 of 99%, Figure 7-2b shows a scatter plot of residuals for site 2. Similarly to Figure 7-2a, residuals based on calculated vs. observed head elevations appear acceptable. The residual histogram for this site (Figure 7-3c) shows the range of residuals more clearly. Histograms are described in the next section. Figure 7-3d presents the residuals as depth to water, which removes the influence of elevation. In contrast to nRMSE presented for data based on elevation in Figure 7-3b, nRMSE is significantly higher (23% for depth to water vs. 3.5% for elevation).
It is a good practice to categorize calibration points based on relative importance. For example, calibration points near boundary conditions (which are likely to be influenced more by the boundary than aquifer parameters) or which are clustered (and therefore are redundant), can be identified. Figure 7-3 provides examples of scatterplots illustrating this concept.
7.3.4.2 Histograms and Cumulative Frequency Plots
A histogram of residuals should be normally distributed with a mean close to zero. Figure 7-2c is an example of such a histogram.
Figure 7-2: Examples of residual presentation plots presented for two sites with high topographic relief: (a) poorly presented graph (b) clearly presented graph (c) histogram of residuals (d) plot of residuals for depth to water.
Figure 7-3: Example of categorized scatterplot: (a) all points, (b) categories of data points by importance.
7.3.4.3 Spatial distributions of residuals and head contours
There are many ways of displaying the spatial distribution of residuals from calibration. Different symbols and colors can be used to enhance the perception of magnitudes and signs of residuals, and their location in relation to other features and boundary conditions.
Figure 7-4 a,b provide examples showing a map-type graphical distribution of errors. In Figure 7-4a, a calibration bias can be seen on the upstream portion of the model domain. The calibration targets of wells P24, P23, P26 and P11 all appear in red, indicating the elevation of the simulated heads at these locations is above 200% of the calibration target (in this case set to +/- 0.5 m). Figure 7-4b illustrates the use of scaled "bubble plots" to show the spatial distribution of residuals. Both of these examples illustrate how showing the spatial distribution of residuals provide insight into what areas of the model have better or worse fit, as well as where the model may have less or more calibration points.
Model cross-sections are also useful to present modeled and observed water table, locations of wells, hydrostratigraphic unit thickness and geometry, vertical flow paths and contours. Cross-sections are particularly useful in steep slopes near boundary conditions and features of interest (e.g. pits, tailings dams, waste rock piles). Figure 7-4c is an example of a cross-section type calibration graphic. In this example, calibration bias in higher elevation areas can be observed (higher elevation areas have different residual error than other areas).
Figure 7-4a: Map of residuals and head contours.
Figure 7-4b: Bubble plots scaled to residuals overlain on map (Scibek and Allen, 2005).
Figure 7-4c: Cross-section plot showing graphical water table representation and calibration residuals.
Figure 7-4: Examples illustrating spatial distribution of residuals.
7.3.4.4 Spatial distribution of fluxes to and from sources and sinks and boundaries
The distribution of flux magnitudes and direction should be graphed on a map to identify which boundary conditions are taking water or producing water. Figure 7-5 is an example of this type of graphic for a model in a mountainous terrain. Dry or inactive drains are identified and the magnitudes of groundwater fluxes toward receiving drains are easily compared. The quantities should be summed up for streams and rivers for which base flow data is available, and compared to modeled and measured quantities. The model should be calibrated to generate the same base flow as observed (within error bounds). The use of fluxes for calibration (in combination with heads) provides an additional constraint to the model and makes the solution more unique.
Figure 7-5: Example of graphical water budget representation.
7.3.4.5 Transient Time Series Plots
Transient calibration can provide improved confidence in model results, particularly in cases where seasonal effects may be important, but transient calibration can be subject to different issues than steady-state calibration.
There are cases when a simulated hydraulic head hydrograph might agree very well with a measured head hydrograph in pattern and amplitude, such that the transient datasets are parallel, but differ in absolute magnitude. Figure 7-6 presents an example of hydraulic heads near a river responding to river stage variation. The average residual error statistics in calibration B suggest a poor calibration, when in fact the modeled transient response might be very good. In this example, the time series regression and correlation have a high coefficient of determination and a good model fit, despite the observed absolute difference in hydraulic head elevation (in this example, calibration point B).
Another technique is the standard correlation function (r) between two time series (Zheng and Bennett, 1995). A more advanced definition of correlation with lag might show whether a model is responding too fast or too slowly. Note that time series regression compares the residuals between observed and regression model (linear usually), so as long as the two time series of water levels vary in time similarly (despite systematic shifts or amplitude changes), the correlation and coefficient of determination will be high. This is a different concept than a simple error measure such as nRMSE. The nRMSE is useful for evaluating spatially distributed residuals, while time series correlation statistics are useful for comparing time series variation with time.

Figure 7-6: Examples of transient calibration presentation for a site located near a small river, showing river-aquifer interaction and transient hydraulic head calibration time series: (a) good calibration except to high outliers, (b) good calibration to variation in time but with shifted datum.
Aquifer heterogeneity and model initial conditions can also effect transient calibration. The transient model fit may vary greatly from point to point due to aquifer heterogeneity, as shown in example from Abbotsford-Sumas Aquifer in Figure 7-7a (Scibek and Allen, 2005). In this example there is also some variation in recharge due to difference in soil types, but a high permeability homogeneous aquifer water table is very smooth despite small scale spatial recharge variation. Large steps in water table result in aquifer heterogeneity (or structural control if in fractured rock).
The transient model may also start with wrong initial conditions (Figure 7-7b) but the overall fit can be reasonably good at later times. What is important is the match in amplitude of variation, no lag in response, good initial conditions, and responsiveness to smaller frequency "events". The vertical datum is of lesser importance in the transient model if the regional gradients are sufficiently calibrated.
The average residual error statistics can be applied to drawdowns (or depth to water) rather than elevation heads, where the drawdown is normalized to the initial head or some other datum. Measured and simulated drawdowns would have separate reference datum. An example of drawdown calibration in four pumping wells is shown in Figure 7-8. The flow model was calibrated to 11 pumping wells by changing aquifer properties in sub-zones. A model does not have to fit all observation points, but a good model will fit most of them and especially the most important points (judged by the modeller).
Figure 7-7: Example of transient calibration affected by heterogeneity and initial conditions (data from Abbotsford-Sumas Aquifer, Scibek and Allen, 2005): (a - left panel) poor fit due to aquifer heterogeneity, (b - right panel) reasonable overall fit but with misfit at early times due to wrong initial conditions.
Figure 7-8: Example of transient model calibration using normalized drawdown. Graphs illustrate residuals for transient calibration to pumping tests that have been normalized to initial (static) drawdown, rather than absolute elevation.
7.3.5 Adequacy of Calibration
Subjective judgment of acceptability is based on confirming observations from components of the modelling process, not only the results of model calibration (Woessner and Anderson, 1996). The evaluation of the adequacy of the calibration of a model should be based more on the insight of the modeller and the appropriateness of the conceptual model rather than the exact value of the various measures of goodness of fit. The reviewers should keep in mind that:
- Just because a model is constructed and calibrated, does not ensure that it is an accurate representation of the system.
- The appropriateness of the boundaries and the system conceptualization is frequently more important than achieving the smallest differences between simulated and observed heads and flows (USGS, 2004).
In the hydrogeological modelling practice, a model is commonly considered calibrated when the correlation coefficient is high and the NRMSE is low. There is no one prescriptive numerical criterion for NRMS because each model is different and there are many other considerations than average residual error measure. Generally, NRMSE under 10% is good in many models, and under 5% is very good in terms of average residual fit, however, the following should be taken into consideration:
- A very low NRMSE is not desirable if the model requires unreasonable assumptions and complexity to achieve that result
- The average fit statistic does not take into account the potential bias, the outliers, and the spatial distribution of residuals.
- At high topographic relief sites, a 10% error bound allows the modeller 20-50m error bars on any head calibration point.
- Presentation of absolute errors (in meters) is very useful as a reality check.
Over-calibration occurs when the model parameters are artificially fine-tuned (to minimize calibration residuals) to a higher degree of precision than is warranted by the knowledge or measurability of the physical hydrogeological system. Without performing model verification (see Section 7.4), the artificially low residuals might otherwise be used to overstate the precision of the model's predictions. During model review, the following may indicate over-calibration:
- The presence of many more zones of equal hydraulic properties than is supported by the available geologic and test data. In any calibration method, the choice to set up recharge zonation is made by the modeller based on physical processes and available data.
- Calibrated values are very precise, to within centimetres at a site with large topographic relief.
- The residuals are very small and there are no outliers (the whole model domain is very well calibrated).
- Very low NRMSE is reported without supporting information about spatial and temporal distribution of residuals or distribution of final calibrated parameters.
In theory it would be possible to specify every cell in a model that has an observation associated with it as a specified head cell in the model. This would produce a perfect match between simulated and observed heads. It is conceptually unreasonable to simulate random cells as specified heads that could serve as sources and sinks of water. Thus, although the measures of calibration might make it appear to be a well-calibrated model, in effect the violation of a reasonable conceptual model makes it a poor model.
A model with very small number of calibration targets may suffer from too coarse calibration if the modelling objectives require more detailed model predictions at spatial scale not supported by the data.
Important aspects of the model, such as the conceptualization of the flow system, that influence the appropriateness of the model to address the modelling objectives, are often not considered during calibration by many investigators; instead their focus is on the quantitative measures of goodness of fit. The appropriateness of the conceptualization of the ground-water system and processes should be evaluated during calibration. Thus, the method of calibration, the closeness of fit between the simulated and observed conditions, and the extent to which important aspects of the simulation were considered during the calibration process are all important in evaluating the appropriateness of the model to address the problem objectives.
7.4 Model Verification (Model Testing)
Once the model is calibrated, additional testing of the calibrated model is advised. This process is called model verification (Woessner and Anderson, 1996).
7.4.1 Verification Process
The process of verification involves running the calibrated model in predictive mode to check whether the prediction reasonably matches the observations of a reserved data set, deliberately excluded from consideration during calibration.
The resulting degree of correspondence can be taken as an indicator or heuristic measure of the uncertainty inherent in the model's predictions.
If adjustments to parameters or boundary conditions are required to achieve verification, then the calibration simulation needs to be re-run, and re-assessed. This process may need to be repeated until a set of parameters and boundary conditions is identified that produces a good match to both the calibration and verification data sets.
When only one data set is available, it is not advisable to artificially split it into separate "calibration" and "verification" data sets, but it is usually more important to calibrate to data spanning as much of the modeled domain as possible. A data set refers to the entire site data (e.g. heads, fluxes, unit geometries) which are sufficient for good model calibration. A second data set might consist of a second set of measurements (heads, fluxes) taken during some stress test (e.g. large scale pumping test, a new excavation or flooding or existing excavation, etc.).
The confidence in the model's performance as a predictive tool would be enhanced if the verification data set was also from a distinct hydrological period (compared to the prediction data set), consistent with recommendations to address the non-uniqueness issue. Verification of a transient model may also be performed against a set of reserved groundwater level hydrographs during the same calibration period, which were not part of the original calibration. In general the greater the change and the applied stress the stronger the model verification.
7.4.2 Model Verification Benefits
The benefits of model verification include:
- Verification improves confidence in calibrated model by using an independent data set.
- Verification estimates the range of uncertainty associated with model predictions, before the predictions are made.
- Verification, with good calibration, provides a level of predictive accuracy that is consistent with the degree of confidence required to answer the modelling objectives.
- Verification protects against over-calibration.
7.4.3 Examples
Examples of model verification include:
- Calibrate using pump test and predict seasonal behavior
- Two- way pumping test (calibrate one pump test and model second test)
- Observe predicted response in new monitoring wells (drilled after model calibration)
- Compare predicted and observed response over time (e.g. pit dewatering; TSF operation etc.)
A calibrated but unverified model may still be used as a predictive tool, provided a sensitivity analysis is undertaken on the calibration and prediction simulations (see Section 8).
7.5 Project Life Calibration
Projects in the natural resource industry have a significant project life often extending over many years to decades. Those projects provide many opportunities to verify, and if required, recalibrate a groundwater model (Figure 7-9). For example, during the early stages of operations when the groundwater system is responding to large new hydraulic stresses such as open pit development and/or start-up of a tailings impoundment, recalibration may be undertaken. The regulatory decisions and requirements for additional reporting or permitting after model re-calibration will be discussed at that time. This process may continue through the operation of the mine and continue to closure and post-closure.
Figure 7-9: Potential for calibration and verification during life of project.
7.6 Case Studies
7.6.1 Case Study 2: Underground Mine
Case Study 2 illustrates the use of multiple calibration models to address uncertainty in the conceptual model. An overview of the project and modelling objectives can be found in Section 3 of the guidelines.
The calibration was carried out in two distinct phases of parallel modelling. Due to uncertainty in the conceptual model, two simulations (Model A and Model B) were carried through the steady state head and flow calibration phases. Both models are based on the hydrogeological units presented in Figure 7-10a. These simulations did not initially include the major fault zone inferred to run through the Site.
During model setup, uncertainty in the distribution of hydraulic conductivity was recognized as having a potential influence on predictions. To address this uncertainty, multiple conceptual models were calibrated in parallel to assess potential effects on model results. Calibration was completed in an initial phase for two plausible hydraulic conductivity distributions based on the same hydrogeological units. During a second calibration phase, hydrogeological units were modified. The hydrogeological unit distribution and calibration results from the different conceptual models are presented in Figure 7-10a and Table 7-1, respectively.
Calibration results for Model A and Model B show nRMSE of less than 10%, which the modeller judged to be acceptable given the (limited) size of the dataset.
Table 7-1: Calibration results for initial phase.
Hydraulic Conductivity Values (m/s)
Figure 7-10a: Hydrogeologic units for first phase of calibration.
Figure 7-10b: Hydrogeological units for second phase of calibration (change is the relatively higher hydraulic conductivity and size in the "granodiorite" unit, and constraints on hydraulic conductivity for other units.
Figure 7-10: Distribution of hydrogeologic units for Case Study 2 calibration phases.
In addition to Model A and Model B of the First Calibration stage, a Second Calibration phase involved the inclusion of a new hydrogeological unit in the mine area to represent a zone of (relatively) higher hydraulic conductivity. The distribution of hydrogeological units of this third model is presented in Figure 7-10b. Furthermore, the rate of sub-glacial recharge was increased in order to assess model sensitivity to this parameter. Again, the calibration targets were achieved for this conceptual model (see Figure 7-11).
Table 7-2 presents the different model results for the various calibrated models. It is seen that the second conceptual model (which includes higher K zones) predicts higher inflows to the underground workings. This example illustrates the non-uniqueness of model calibration. Three different conceptual models could be calibrated with equally good calibration statistics.
Table 7-2: Comparison of model results for Case Study 2 conceptual models and calibrations.
Figure 7-11: Scatterplot of calibration results for Second Calibration model, Case Study 2.
7.6.2 Case Study 3: Groundwater Extraction Project
Case Study 3 illustrates the use of both steady-state and transient calibration. For groundwater extraction projects, pumping tests will be expected. Models developed to assess these projects benefit from the ability to have calibration not only to aquifer-wide hydraulic heads (ideally) but also the transient aquifer response to the pumping test. These calibration phases allowed calibration to both hydraulic conductivity and storage parameters.
In Case Study 3, an initial, steady-state baseline model was calibrated to observed head values from across the model domain. Calibration residuals were presented in scatterplot form (Figure 7-12).
Following this, the model was calibrated transiently to a pumping test conducted on one of the pumping wells. Transient calibration results were presented as time series and are illustrated in Figure 7-13. This calibration allowed better constraint of specific storage and specific yield, in the area affected by the pumping test.
Additional, aquifer-wide transient calibration was then completed by applying the storage parameters from the pumping test calibration to the entire aquifer and applying seasonal recharge rates for different time periods.
Results from this last transient calibration period were presented graphically, as maps showing both observed and modeled water levels for two discrete periods (October and January). Figure 7-14 shows the water table calibration map for the October period.
By using steady-state and transient calibrations, confidence in model predictions is improved. For groundwater extraction studies, in which the water balance for the entire aquifer is of importance, transient calibration to both pumping tests and seasonal water level fluctuations is important. Calibration only to the pumping test data would not necessarily provide confidence that other important factors, such as seasonal recharge, were appropriately evaluated.

Figure 7-12: Scatterplot of steady-state calibration residuals for Case Study 3.
Figure 7-13: Results of transient calibration to pumping test data for Case Study 3.
Note: Blue lines = observed; red lines = simulated
Figure 7-14: Water table calibration map for Case Study 3 seasonal transient model.
Summary Points for Model Calibration & Verification
- Model calibration is an iterative process of refining the numerical model's representation of the hydrogeological system to achieve a desired degree of correspondence between the calculated values (model simulation) and observations of the groundwater flow system.
- Alternative conceptual models should be discussed and/or calibrated if there is large uncertainty in the choice of one.
- The observed data used as calibration targets must have sufficient spatial distribution for all models, and sufficient temporal distribution for transient models. A large number of uniformly distributed calibration targets will increase the likelihood of obtaining a unique calibration.
- The error bounds and calibration targets should be set before the calibration process. Sources of error in each calibration point should be assessed and quantified to assign relative weights to data points before starting calibration.
- Calibration of a numerical model may be done by manual trial and error or by automatic parameter estimation methods, or a combination of the two. Manual trial-and-error calibration gives the modeller significant insight into the factors controlling the system and should always be part of model calibration.
- The use of automated parameter estimation techniques requires significant specialized experience by the modeller. If not used properly, this method may yield incorrect results or no result at all.
- Non-uniqueness arises because many different sets of model input parameters can produce nearly identical model outputs. Non-uniqueness cannot be eliminated but it may be reduced by restricting parameter ranges to observed ranges, and to use groundwater flux observations or ranges to constrain the solution.
- Calibration residuals are the basic quantitative measures of goodness of fit, and their distribution in space and in relation to model features and boundaries is very important. In general, the residual should be a small fraction of the difference between the highest and lowest heads across the site, but acceptable residuals depend on site-specific model setup.
- Graphical model error plots such as scatterplots of residuals are very important in describing the model goodness of fit of individual calibration targets (usually hydraulic heads). It is very important to display the spatial distribution of residuals from calibration.
- Transient calibration can provide improved confidence in model results, particularly in cases where seasonal effects may be important.
- Subjective judgment of acceptability is based on confirming observations from components of the modelling process, not only the results of model calibration. The insight of the modeller and the appropriateness of the conceptual model are more important than the exact value of the various measures of goodness of fit. Just because a model is constructed and calibrated, does not ensure that it is an accurate representation of the system.
- Verification reduces uncertainty of model predictions. Projects in the natural resource industry have a significant project life often extending over many years to decades. Those projects provide many opportunities to verify, and if required, recalibrate a groundwater model.
Review Questions
- Which of the following statements about the calibration process is FALSE:
- Model calibration is an iterative process of refining the numerical model's representation of the hydrogeological framework.
- Calibration parameters are adjusted to match the selected calibration targets.
- Alternative conceptual models could be calibrated.
- Observed data used as calibration targets must have sufficient spatial distribution for all models.
- A small residual measure indicates a unique solution.
- Difficulties in calibrating the numerical model to field data may indicate a problem with the quality of the monitoring data.
- The insight of the modeller and the appropriateness of the conceptual model are more important than the exact value of the measure of goodness of fit.
- Just because a model is constructed and calibrated, does not ensure that it is an accurate representation of the system.
- How can non-uniqueness be reduced in model calibration?
- By using more data for calibration targets.
- Restrict the calibration parameter ranges to observed parameter ranges.
- Use the observed range of groundwater flux to constrain the solution.
- Use more strict calibration criteria.
- Include detailed hydraulic parameter and recharge distribution zones and as detailed as possible boundary conditions.
- Over-calibrate the model to ensure good fit.
- A, B, and C.
- All of the above.
- Graphical representation of calibration residuals is usually shown with:
- Scatterplots and histograms of residuals.
- Maps of head residuals.
- Times series of head values at observation points.
- Flux magnitudes and directions at boundaries.
- All of the above.
- A model calibration result is acceptable if the following criteria are satisfied:
- nRMSE is below 5%.
- Head residuals are small in most important areas of the model and are evenly distributed in space, and/or flux is within observed bounds.
- Modeller is convinced that the model is properly calibrated.
- Model results are judged to be reasonable relative to the conceptual model.
- B and D.
- All of the above.
- The main purpose of model verification is:
- Validation of the predictive model to prove the model accuracy.
- Checking of model calibration with a different data set.
- To evaluate the uncertainty of model predictions.
- Re-calibration with new data sets over the project life.
Proceed to Section 8: Model Prediction & Uncertainty