health_dev: Subnational reproductive, maternal, newborn, child and adolescent health and development atlas for India, version 1.1
2022-07-26
Table 1. Files and their descriptions within the health_dev GitHub repository for the paper Subnational reproductive, maternal, newborn, child, and adolescent health and development atlas for India.
Name | Type | Description |
---|---|---|
out | Folder | Folder contain the prediction and uncertainty gridded datasets (raster files) produced from the prediction R script and the out- of-sample cross validation summary statistics (csv files) from the validation R script. |
rda | Folder | Folder to contain INLA objects and the model summary statistics (saved as rda files) produced from the modelling R script. Files within this folder will be required to run the prediction and validation R scripts. |
shp | Folder | This folder contains the shapefiles required to run all R scripts in this repository. These should be the administrative boundaries of the study area as polygons and the location of the clusters in the study area as points (lat/lon) These shapefiles can be obtained from the DHS program at www.dhsprogram.com. |
tif | Folder | This folder contains the raster files for all geospatial covariates. Files within this folder are required to run the prediction R script. Examples of geospatial covariate datasets can be found at www.hub.worldpop.org/project/categories?id=14. |
covariates | csv | This file contains a demo of the format of the data extracted from geospatial covariates considered when modelling the health and development indicators. This file is required to run all R scripts in this repository. Examples of geospatial covariate datasets can be found from https://hub.worldpop.org/project/categories?id=14. |
indicators | csv | This file contains a demo of health and development indicators to model. This file is required to run all R scripts in this repository. The indicators were extracted from the India NFHS-4 (National Family Health Survey 4) 2015-16 DHS (Demographic Health Survey) (1-3) database, which are publicly available after registration onto the Measure DHS website (www.dhsprogram.com). |
modelling | R | R script for modelling the health and development indicators. The files required to run this script are the covariates and indicator csv files and the files in the shp folder. This script outputs an INLA object and the model summary statistics (both saved as rda files). Further description of the methodology is given in the sections below. |
prediction | R | R script for predicting the health and development indicators. The files required to run this script are the covariates and indicators csv files, the files in the shp folder, the files in the tif folder, and the files in the rda folder. This script outputs a prediction gridded dataset (tif file) and an uncertainty gridded dataset (tif file) for target indicator and are saved to the out folder. |
validation | R | R script for out-of-sample (k- fold) validation for the models of the health and development indicators. The files required to run this script are the covariates and indicators csv files, the files in the shp folder, and the files in the rda folder. This script outputs k- fold summary statistics as csv files. Further description of the methodology is given in the sections below |
The geospatial covariate selection is two-staged. In the first stage, we check for multicollinearity amongst the geospatial covariates. In the second stage, we employ the back-ward stepwise model selection method.
To check for multicollinearity, a Pearson correlation matrix for the
geospatial covariates is created and any pairs with a Pearson
correlation coefficient
are flagged. The flagged covariates are then individually fitted in
non-Bayesian binomial generalised linear models (GLMs). The Bayesian
information criteria (BIC) of the models are then calculated. The
covariate in the model with a lower BIC is retained while the covariate
in the model with the greater BIC is omitted for the target indicator.
To further ensure that multicollinearity is not a problem between the
remaining geospatial covariates, variance inflation factors (VIFs) are
calculated. If any covariate returns a VIF > 4, it is omitted.
After checking for multicollinearity, a backward model selection algorithm is used to select the best (sub)set of geospatial covariates for the target indicator. The algorithm is as follows. The remaining geospatial covariates are fitted in a non-Bayesian binomial GLM and the BIC is calculated. A covariate is removed from the model and the BIC is recalculated. If the recalculated BIC is less than the previously calculated BIC, this subset of covariates is preferred. These steps are performed iteratively until the recalculated BIC is not less than the BIC calculated from the previous iteration. At this point, the best (sub)set of geospatial covariates have been attained and they will be used when constructing the Bayesian point-referenced spatial binomial GLM in INLA.
The constructed Bayesian point-referenced spatial binomial GLM is given as follows.
The number of occurrence of events of the target indicator
within cluster locations
for
follows a Binomial distribution with the total number of surveys
conducted within the cluster locations
and the proportion of events happening in the cluster
.
With a logit link,
is calculated with a linear combination of the fixed effects
,
spatial random effects
and independent identical (iid) random effects
.
The fixed effects are given by the geospatial covariates
selected from the backward model selection algorithm mentioned above and
is a vector of regression coefficients to be estimated. The spatial
random effects follow a multivariate normal distribution with zero-mean
and some covariance matrix
.
In this study, elements of the covariance matrix are calculated with the
exponential covariance function. The exponential covariance function is
calculated with the spatial variance
,
the spatial decay parameter
and the
Euclidean distance matrix
between the cluster locations. The parameters
and
are unknown and are to be estimated in INLA. The iid random effects
follow a normal distribution with a mean of zero and an unknown variance
which will be estimated along with the other parameters mentioned above.
Additional components must be constructed before fitting the model in INLA. First a mesh of the study domain is constructed with the shape file and coordinates within the target indicator file. Using this mesh object, a stochastic partial differential equation (SPDE) object is defined with functions in INLA where the priors of the spatial decay parameter and spatial variance parameter is defined. With the mesh object, INLA stack “A” matrices are created and stacked with the INLA stack functions. Finally, these components, along with the model are fitted into the INLA function.
The prediction R script loads the generates posterior samples from the INLA object (saved from the modelling R script). Then it reads in the raster files corresponding to the geospatial covariates of the model for the target indicators and compiles it as a prediction data frame. Finally, the predicted values are computed from the prediction data frame, INLA mesh objects and INLA posterior sample objects, and are slotted to the cells in the raster file – producing the high-resolution (5x5km) prediction and uncertainty gridded datasets / surfaces as tif files.
The validation R script accesses the performance of the model constructed for the target indicator from the modelling R script with k-fold cross validations and compute evaluation metrics. The k-fold cross validation functions by first partitioning the dataset into k parts, then training the model with k-1 parts of the dataset and testing the trained model with the kth part of the dataset. The model is the Bayesian point-referenced spatial generalized linear model constructed in the modelling R script (i.e., with the same (sub)set of geospatial covariates) for the target indicator. For each fold, the following evaluation metrics are calculated:
the Pearson’s correlation coefficient, the root mean squared error, the
mean absolute error, percentage bias, and the coverage rate. In the
evaluate metrics above,
is used to denote the observed values – i.e., the proportions of the
target indicators partitioned for testing – and
is used to denote the predicted mean values from the Bayesian
point-referenced spatial binomial generalized linear model.
The notation
is used to the denote the Pearson’s correlation coefficient where
explicitly it is calculated with the covariance of the observed and
predicted values, and the standard deviation of the observed and
predicted values
Here, note that the vectors
and
where
is the number of observations partitioned for testing. Better predictive
performance is reflected from a greater Pearson’s correlation
coefficient. The root mean squared error (RMSE), mean absolute error
(MAE) and percentage bias have straightforward calculations that does
not require additional explanation. Better predictive performance is
reflected from smaller RMSE, MAE and percentage bias values. The
coverage rate which ranges from 0 to 100. First,
in the equation is defined as follow
where
and
represents the ith 0.025 quantile and 0.0975 quantile predicted value.
To put it simply,
is either 1 or 0, for
,
depending on some condition. This condition is if the observed value is
within the 0.025 quantile and 0.0975 quantile of the predicted value,
,
otherwise
.
Better predictive performance is reflected from a higher coverage rate.
The validation R script returns csv files with the evaluation metrics calculated for each fold for the model of the target indicator being validated.
The work is funded by the Children’s Investment Foundation Fund (CIFF) (R-2009-05106). The authors acknowledge the support of the PMO Team at WorldPop and would like to thank EME and India Programme Team at CIFF for their inputs and continuous support, and all staff at CIFF who provided feedback at each stage of this work. Moreover, the authors would like to thank the DHS Program staff for their input on the construction of some of the indicators. This work was approved by the ethics and research governance committee at the University of Southampton (ERGO 64920).
Chan, H.M.T, Dreoni, I., Tejedor-Garavito, N., Kerr D., Bonnie, A.,
Tatem A.J. and Pezzulo, C. 2022. health_dev: Subnational reproductive,
maternal, newborn, child and adolescent health and development atlas for
India, version 1.1. WorldPop, University of Southampton.
.
- International Institute for Population Sciences - IIPS/India and ICF. [Producers]. 2017. National Family Health Survey NFHS-4, [Datasets IABR74DT.dta; IACR74DT.dta; IAHR74DT.dta; IAIR74DT.dta; IAKR74DT.dta; IAMR74DT.dta; IAPR74DT.dta; IAGE71FL.shp], 2015-16: India. Mumbai: IIPS. ICF [Distributor], 2017. 6 International Institute for Population Sciences - IIPS/India and ICF. 2017. National Family Health Survey NFHS-4, 2015-16: India. Mumbai: IIPS. (www.dhsprogram.com)
- International Institute for Population Sciences (IIPS), I. and ICF., India National Family Health Survey NFHS-4 2015-16. Mumbai, India: IIPS and ICF. Available at http://dhsprogram.com/pubs/pdf/FR339/FR339.pdf. 2017
- The DHS Program Code Share Project, Code Library, DHS Program. DHS Program Github site. https://github.com/DHSProgram., in DHS Program Github site. 2022.