Multiple imputation for missing dating data¶
Network imputation for missing dating data in EDH
¶
Contents¶
- Imputation and Missing data
Missing data problems
MICE algorithm
- Temporal uncertainty and relative dating
time-spans of existence and missing dates (within-phase uncertainty)
- Roman inscriptions in EDH database
variable attributes similarities
- Statistical inference on missing dates
multivariate and univariate distribution of missing data (joint vs conditional modelling)
MCMC
- FCS and multiple imputation for EDH province data set
mice
implementationPredictive mean matching
Random forest
- Deterministic methods on EDH data subsets
MNAR dates with supervised (restricted) imputation
Imputation and Missing data¶
In statistics, imputation is the process of replacing missing data with plausible estimates, and multiple imputation is the method of choice for complex incomplete data problems.
With the joint modeling approach, imputing multivariate data involves specifying a multivariate distribution for the missing data, and then drawing imputation from their conditional distributions by Markov chain Monte Carlo (MCMC) techniques.
The fully conditional specification is a variable-by-variable type of imputation that is made by iterating over conditional densities.
Missing dating data¶
The treatment of missing values defining the timespan of the existence of historical artefacts concerns with the temporal uncertainty problem. Time uncertainty relates to the missing information in the limits of the timespan, which represent boundaries of existence with a terminus ante- and post-quem, abbreviated as TAQ and TPQ.
As study case, the artefacts are epigraphic material or inscriptions recorded in the EDH dataset with unknown information in time in both limits of the timespan, and hence there is no timespan, or just in either TAQ or TPQ.
Missing data problems¶
Every data point has some likelihood of being missing. Rubin (1976) classified missing data problems into three categories: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).
For some parameter \(\psi\), where \(Y_{obs}\) is the observed sample data and \(Y_{mis}\) the unobserved sample data, the overall probability of being missing \(x\) depends,
MCAR: only on some parameters \(\psi\)
\(P(x=0 \mid Y_{obs}, Y_{mis}, \psi) = P(x=0 \mid \psi)\)
MAR: on observed information, including any design factors
\(P(x=0 \mid Y_{obs}, Y_{mis}, \psi) = P(x=0 \mid Y_{obs}, \psi)\)
MNAR: also depends on unobserved information, including \(Y_{mis}\) itself
\(P(x=0 \mid Y_{obs}, Y_{mis}, \psi)\)
See also
MICE algorithm¶
mice
is an R
package that implements Multivariate Imputation by Chained Equations using Fully Conditional Specification
MICE algorithm is a Markov chain Monte Carlo (MCMC) method, where the state space is the collection of all imputed values.
…
Missing data patterns: - Monotone: increasing order of the number of missing data
Restricted imputation on dates¶
One strategy for dealing with temporal uncertainty if they have missing data for both limits TAQ and TPQ is performing a classification of the inscription to the chronological period with the highest probability of belonging.
The classification takes available characteristics of other inscriptions assigned to a chronological phase to provide with clues in finding such likelihoods for records having a temporal uncertainty.