Title: | Multivariate Multinomial Distribution Approximation and Imputation for Incomplete Categorical Data |
---|---|
Description: | A method to impute the missingness in categorical data. Details see the paper <doi:10.4310/SII.2020.v13.n1.a2>. |
Authors: | Chaojie Wang |
Maintainer: | Chaojie Wang <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.0.0 |
Built: | 2024-10-31 21:07:10 UTC |
Source: | https://github.com/cran/MMDai |
This function is used to generate random datasets following mixture of product multinomial distribution
GenerateData( n, p, d, k = 3, theta = rdirichlet(1, rep(10, k)), psi = InitialPsi(p, d, k) )
GenerateData( n, p, d, k = 3, theta = rdirichlet(1, rep(10, k)), psi = InitialPsi(p, d, k) )
n |
- number of samples |
p |
- number of variables |
d |
- a vector which denotes the number of categories for each variable. It could be distinct among variables. |
k |
- number of latent classes |
theta |
- probability for latent class |
psi |
- probability for specific category |
data - generated random dataset, a matrix with n rows and p columns.
# dimension parameters n<-200; p<-5; d<-rep(2,p); # generate complete data Complete<-GenerateData(n, p, d, k = 3)
# dimension parameters n<-200; p<-5; d<-rep(2,p); # generate complete data Complete<-GenerateData(n, p, d, k = 3)
This function is used to perform multiple imputation for missing data given the joint distribution.
Imputation(data, theta, psi)
Imputation(data, theta, psi)
data |
- incomplete dataset |
theta |
- vector of probability for each component |
psi |
- specific probability for each variable in each component |
ImputedData - dataset has been imputated.
This function creates a psi list in that each component has equal weight
InitialPsi(p, d, k)
InitialPsi(p, d, k)
p |
- number of variables |
d |
- a vector which denotes the number of categories for each variable. It could be distinct among variables. |
k |
- number of components |
psi - a list in that each component has equal weight
This function is used to find the suitable number of components k.
kIdentifier(data, d, TT = 1000, alpha = 0.25)
kIdentifier(data, d, TT = 1000, alpha = 0.25)
data |
- data in matrix formation with n rows and p columns |
d |
- number of categories for each variable |
TT |
- number of iterations in Gibbs sampler, default value is 1000. T should be an even number for 'burn-in'. |
alpha |
- hyperparameter that could be regarded as the pseudo-count of the number of samples in the new component |
k_est - posterior estimation of k
k_track - track of k in the iteration process
# dimension parameters n<-200; p<-5; d<-rep(2,p); # generate complete data Complete<-GenerateData(n, p, d, k = 3) # mask percentage of data at MCAR Incomplete<-Complete Incomplete[sample(1:n*p,0.2*n*p,replace = FALSE)]<-NA # k identify K<-kIdentifier(data = Incomplete, d, TT = 10)
# dimension parameters n<-200; p<-5; d<-rep(2,p); # generate complete data Complete<-GenerateData(n, p, d, k = 3) # mask percentage of data at MCAR Incomplete<-Complete Incomplete[sample(1:n*p,0.2*n*p,replace = FALSE)]<-NA # k identify K<-kIdentifier(data = Incomplete, d, TT = 10)
This is a real application dataset. The source of original data is the ratings dataset in (Harper and Konstan (2016) <DOI:10.1145/2827872>). This dataset is used to evaluate the performance of package in real applications.
Chaojie Wang
This function is used to estimate theta and psi in multinomial mixture model given the number of components k.
ParEst(data, d, k, TT = 1000)
ParEst(data, d, k, TT = 1000)
data |
- data in matrix formation with n rows and p columns |
d |
- number of categories for each variable |
k |
- number of components |
TT |
- number of iterations in Gibbs sampler, default value is 1000. T should be an even number for 'burn-in'. |
theta - vector of probability for each component
psi - specific probability for each variable in each component
# dimension parameters n<-200; p<-5; d<-rep(2,p); # generate complete data Complete<-GenerateData(n, p, d, k = 3) # mask percentage of data at MCAR Incomplete<-Complete Incomplete[sample(1:n*p,0.2*n*p,replace = FALSE)]<-NA # k identify K<-kIdentifier(data = Incomplete, d, TT = 10) Par<-ParEst(data = Incomplete, d, k = K$k_est, TT = 10)
# dimension parameters n<-200; p<-5; d<-rep(2,p); # generate complete data Complete<-GenerateData(n, p, d, k = 3) # mask percentage of data at MCAR Incomplete<-Complete Incomplete[sample(1:n*p,0.2*n*p,replace = FALSE)]<-NA # k identify K<-kIdentifier(data = Incomplete, d, TT = 10) Par<-ParEst(data = Incomplete, d, k = K$k_est, TT = 10)
This function is generate random sample from Dirichlet distribution
rdirichlet(n = 1, alpha = c(1, 1))
rdirichlet(n = 1, alpha = c(1, 1))
n |
- sample size |
alpha |
- parameters in Dirichlet distribution |
out - generated data
# dimension parameters rdirichlet(n=10,alpha=c(1,1,1))
# dimension parameters rdirichlet(n=10,alpha=c(1,1,1))