Department of Mathematics, Bartin University, Bartin, 74100, TR
Abstract
In this paper, a new regression model for count response variable is proposed via re-parametrization of Poisson quasi-Lindley distribution. The maximum likelihood and method of moment estimations are considered to estimate the unknown parameters of re-parametrized Poisson quasi-Lindley distribution. The simulation study is conducted to evaluate the efficiency of estimation methods. The real data set is analyzed to demonstrate the usefulness of proposed model against the well-known regression models for count data modeling such as Poisson and negative-binomial regression models. Empirical results show that when the response variable is over-dispersed, the proposed model provides better results than other competitive models.
Introduction
The interest on count data modeling has been greatly increased in the last decade. The widely used distribution for modeling the count data sets is Poisson distribution. The well-known property of Poisson distribution is that its mean and variance are equal. Therefore, Poisson distribution does not work in the case of over-dispersion or under-dispersion. Poisson distribution is widely used in many research fields such as actuarial, environmental, actuarial and economics sciences in spite of its weakness. The reason for that comes from its simple form and easy implementation and software support. To remove the drawback of Poisson distribution, researchers have shown great interest to introduce mixed-Poisson distributions for modeling the over-dispersed or under-dispersed count data sets such as Bhati et al. [
1
], Imoto et al. [
7
], Mahmoudi and Zakerzadeh [
9
], Gencturk and Yigiter [
5
], Wongrin and Bodhisuwan [
15
], Déniz [
3
], Cheng et al. [
2
], Lord and Geedipally [
8
], Zamani et al. [
16
], Sáez-Castillo and Conde-Sánchez [
12
], Rodríguez-Avi et al. [
10
], Shmueli et al. [
11
], Shoukri et al. [
13
].
As mentioned above, Poisson distribution is insufficient to model the over-dispersed count data sets. The main motivation of this study is to introduce an alternative regression model for modeling the over-dispersed count data sets. Therefore, a re-parametrization of Poisson quasi-Lindley distribution, proposed by Grine and Zeghdoudi [
6
], is introduced and its statistical properties are studied comprehensively such as mean, variance and estimation problem of the model parameters. The maximum likelihood (ML) and method of moments (MM) estimation methods are considered to estimate the unknown parameters of the re-parametrized PQL distribution. The efficiencies of the estimation methods are compared with extensive simulation study. Using the re-parametrized Poisson quasi-Lindley distribution, a new regression model for over-dispersed count data sets is introduced. To demonstrate the effectiveness of proposed regression model, a real data set on days of absence of the high school students are analyzed with Poisson, negative-binomial and PQL regression models.
The rest of the paper is organized as follows: In “Re-parametrization of Poisson quasi-Lindley distribution” section, the statistical properties of the re-parametrized Poisson quasi-Lindley distribution are obtained. In “Estimation” section, ML and (MM) estimation methods are considered to estimate the unknown model parameters. In “Simulation” section, finite sample performance of estimation methods is compared via a Monte Carlo simulation study. In “Poisson quasi-Lindley regression model” section, a new regression model is introduced. In “Empirical study” section, a real data set is analyzed to demonstrate the usefulness of proposed model against the Poisson and negative-binomial regression models. “Conclusion” section contains the concluding remarks.
Re-parametrization of Poisson quasi-Lindley distribution
Let the random variable
X
follows a Poisson distribution. The probability mass function (pmf) is
Px;λ=exp-λλxx!,x=0,1,2,…,
where
λ>00$$\end{document}]]>
. The mean and variance of Poisson distribution are
EX=λ
and
VarX=λ
, respectively. So, the dispersion index, shortly
DI
, for Poisson distribution is
DI=VarX/EX=λ/λ=1
. As seen from the dispersion index of Poisson distribution, the over-dispersed or under-dispersed data sets cannot be modeled by Poisson distribution. Note that when the variance is greater than mean, the over-dispersion occurs; otherwise, it is called as under-dispersion. Grine and Zeghdoudi [
6
] introduced a new mixed-Poisson distribution, called Poisson quasi-Lindley (PQL), by compounding Poisson distribution with quasi-Lindley distribution, introduced by Shanker and Mishra [
14
]. The pmf of PQL distribution is given by
PY=y=θα+1αθ+1+θy+1θ+1y+2,y=0,1,2,…,
where
θ>00$$\end{document}]]>
and
α>-1-1$$\end{document}]]>
. Hereafter, the random variable
Y
will be denoted as
PQLθ,α
. The corresponding cumulative distribution function (cdf) to
1
is
Fy=PY≤y=1-α+2θ+αθ+θy+11+α1+θy+2.
The mean and variance of PQL distribution are given by, respectively,
EY=2+α1+αθVarY=2+4α+α2+θα+2α+1α+12θ2
Here, the re-parametrization of PQL distribution is considered. The motivation of re-parametrization for PQL distribution comes from the generalized linear model approach.
Proposition 1
Letθ=2+α/1+αμ
,
then the pdf of PQL distribution isPY=y=2+αα+12μα+2+αμ+αμ-1y+α+11+2+αμ+αμ-1y+2,y=0,1,2,…,whereα>00$$\end{document}]]>andμ>00$$\end{document}]]>
.
The mean and variance of5
are given by, respectively,
EY=μ,VarY=μ⏟I+μ22+α22+4α+α2⏟II.
Note that the parameter
α
should be greater than zero to ensure the positive variance. The other statistical properties of PQL distribution, such as probability and moment generating functions, mode and its cdf, under the above re-parametrization can be obtained following the results in Grine and Zeghdoudi [
6
]. As seen from
6
, since the second part of variance equation for PQL distribution is greater than zero for all values of the parameters
α
and
μ
, the variance of PQL distribution is always greater than its mean. Therefore, PQL distribution can be a good choice for modeling the over-dispersed data sets.
Figure
1
displays the dispersion index and possible shapes of PQL distribution. When the parameters
α
and
μ
increase, the dispersion of PX distribution increases. Note that the effect of the parameter
μ
on dispersion is higher than that of parameter
α
. As seen from left side of Fig.
1
, PQL distribution can be a good choice for modeling extremely right-skewed data sets.
Fig. 1
The dispersion index (right) and the pmf shapes of PQL distribution (left) for some values of
α
and
μ
Generating random variables from Poisson-xgamma distribution
Here, a general algorithm and corresponding code written in R software are given to generate random variables from PQL distribution. The below code can be used for all discrete distributions such as Poisson, Poisson–Lindley, negative-binomial.
Estimation
In this section, ML and MM estimation methods are considered to estimate the unknown parameters of PQL distribution.
Maximum likelihood estimation
Let
X1,X2,⋯,Xn
be independent and identically distributed
PQL
random variables. The log-likelihood function is
ℓα,μ=nln2+αα+12μ+∑i=1nlnα+2+αμ+αμ-1yi+α+1-ln1+2+αμ+αμ-1∑i=1nyi+2.
Taking partial derivatives of (
7
) with respect to
α
and
μ
, we have
∂ℓ∂α=nα+2-1α+12μ1α+12μ-2α+2α+13μ-α+2αμ+μ+1-11αμ+μ-α+2μαμ+μ2∑i=1nyi+2+∑i=1nα+2αμ+μ+α+2μyi+α+1αμ+μ2+yi+α+1αμ+μ+1α+α+2yi+α+1αμ+μ∂ℓ∂μ=∑i=1nα+1α+2yi+2αμ+μ2α+2αμ+μ+1-∑i=1nα+1α+2yi+α+1αμ+μ2α+2yi+α+1αμ+μ+α-nμ
The ML estimates of
α,λ
can be obtained by means of simultaneous solutions of
8
and
9
. It is not possible to obtain explicit forms of ML estimates of PQL distribution since the likelihood equations contain nonlinear functions. For this reason, nonlinear minimization tools are needed to solve these equations. The nonlinear minimization (
nlm
) function of R software is used for this purpose. The corresponding interval estimations of the parameters are obtained by means of observed information matrix which is given by
IF(τ)=-IααIαμIμαIμμ.
The elements of observed information matrix are upon request from the authors. It is well known that under the regularity conditions that are fulfilled for the parameters, the asymptotic joint distribution of
(α^,μ^)
, as
n→∞
is a bi-variate normal distribution with mean
(α,μ)
and variance–covariance
IF-1(τ)
. Using the asymptotic normality, the asymptotic
100(1-p)%
confidence intervals for the parameters
α
and
μ
, respectively, are given by
α^±zp/2Var(α)^,μ^±zp/2Var(μ)^,
where
zp/2
is the upper
p
/ 2 quantile of the standard normal distribution.
Method of moments
The MM estimators of the parameters
α
and
μ
can be obtained by equating the mean and variance of PQL distribution to sample mean and variance, given as follows
y¯=μ,s2=μ+μ22+α22+4α+α2,
where
y¯
and
s2
are the sample mean and variance, respectively. For simultaneous solution of (
10
) and (
11
), we have
μ^MM=y¯,α^MM=2y¯2y¯2+y¯-s2-2
Theorem 1
For fixed values ofμ
,
MM estimatorα^MMofαis consistent and asymptotically normal distributed:nX¯-μ→N0,υ2θ,whereυ2θ=2+4α+α2+θα+2α+1α+12θ2.
The detailed information about asymptotic properties of MM estimators can be found in Farbod and Arzideh [
4
].
Simulation
In this section, Monte Carlo simulation study is conducted to evaluate the finite sample performance of ML and MM estimates of PQL distribution. The following simulation procedure is used.
Set the sample size
n
and the vector of parameters
θ=α,μT
;
Generate random observations from the
PQLα,μ
distribution, using the algorithm given in “Generating random variables from Poisson-xgamma distribution” section, with size
n
;
Use the generated random observations in Step 2, and estimate
θ
by means of ML and MM estimation methods;
Repeat
N
times the steps 2 and 3;
Use
θ^
and
θ
and calculate the biases, mean relative estimates (MREs) and mean square errors (MSEs) from the following equations:
Bias=∑j=1Nθ^i,j-θiN,MRE=∑j=1Nθ^i,j/θiN,MSE=∑j=1Nθ^i,j-θi2N,i=1,2.
Figure
2
displays results of simulation study performed under the above procedure. The following parameters are considered:
θ=α=0.5,μ=1.5T
,
N=10,000
and
n=40,45,50,…,500
. When
n
is sufficiently large, MREs should be closer to one and MSEs and biases should be closer to zero. As seen from Fig.
2
, when the sample size,
n
, increases, the MSEs and biases are closer to zero and MREs approach to one for both estimation methods. The MM and ML estimation methods yield similar results for the parameter
μ
in view of estimated MSE, bias and MRE. However, ML estimation method provides more satisfactory results for the parameter
α,
especially for small sample sizes. Therefore, we suggest to use ML estimation method when the sample size is small.
Fig. 2
Estimated biases, MSEs and MREs of the parameters of PQL distribution based on the ML and MM estimation methods
Poisson quasi-Lindley regression model
The Poisson and negative-binomial are the two commonly used regression models for count data modeling. When the response variable is not equi-dispersed, the negative-binomial regression model is preferable. Here, an alternative regression model is introduced for over-dispersed response variable.
Let random variable
Y
follow a PQL distribution, given in (
5
). The mean of
Y
is
EY|α,μ=μ
. Therefore, the covariates can be linked to the mean of response variable,
y
, by means of the log-link function, given by
μi=expxiTβ,i=1,…,n,
where
xiT=xi1,xi2,…xik
is the vector of covariates and
β=β0,β1,β2,…βkT
is the unknown vector of regression coefficients. Inserting (
16
) in (
5
), the log-likelihood function can be obtained as follows
ℓτ=nln2+α-∑i=1nlnα+12expxiTβ+∑i=1nlnα+2+αexpxiTβ+αexpxiTβ-1yi+α+1-∑i=1nyi+2ln1+2+αexpxiTβ+αexpxiTβ-1,
where
τ=α,βT
. The unknown parameters,
α
and
β=β0,β1,β2,…βkT
, are obtained by maximizing (
16
) with
the nlm
function of R software. Under standard regularity conditions, the asymptotic distribution of
(τ^-τ)
is multivariate normal
Nk+2(0,J(τ)-1)
, where
J(τ)
is the expected information matrix. The asymptotic covariance matrix
J(τ)-1
of
τ^
can be approximated by the inverse of the
(k+2)×(k+2)
observed information matrix
I(τ)
, whose elements are evaluated numerically via most statistical packages. The approximate multivariate normal distribution
Nk+2(0,I(τ)-1)
for
τ^
can be used to construct asymptotic confidence intervals for the vector of parameters
τ
.
Empirical study
In this section, modeling ability of PQL regression model is compared with Poisson and NB regression models via an application on real data set. The data contain number of absence (daily), gender and type of instructional program of the 314 high school students from two urban high schools. The data set can be obtained from
https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta
. The response variable, number of absence
yi
, is modeled with gender (female = 1, male = 0)
x1
and type of instructional program (general = 1, academic = 2, vocational = 3). The vocational program is used as a baseline category for type of instructional program variable. The general and academic instructional programs are coded as
x2
and
x3
, respectively. To decide the best model, the estimated negative log-likelihood value, Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) values are used. The lowest values of these statistics show the best-fitted model for the used data set. The following regression model is fitted.
μi=expβ0+β1x1i+β2x2i+β3x3i
Figure
3
displays the distribution of days of absence. The mean of response variable is 5.955, and variance is 49.518 which is an evidence for over-dispersion.
Fig. 3
The distribution of days of absence of students
Table
1
lists the estimated parameters of the models and corresponding SEs, estimated negative
-ℓ
, AIC and BIC values. Since PQL regression model has the lowest values of these statistics, we conclude that PQL regression model provides better fits than Poisson and NB regression models, especially for over-dispersed data set.
Table 1
The estimated parameters of models and goodness-of-fit statistics
Covariates
Poisson
NB
PQL
Estimate (SE)
SE
p value
Estimate
SE
p value
Estimate
SE
p value
Intercept
1.323
0.089
<0.001
1.271
0.214
<0.001
1.273
0.215
<0.001
Gender
-0.234
0.047
<0.001
-0.193
0.123
0.118
-0.191
0.122
0.119
General
1.374
0.076
<0.001
1.362
0.199
<0.001
1.348
0.198
<0.001
Academic
0.957
0.066
<0.001
0.949
0.140
<0.001
0.945
0.138
<0.001
Dispersion
–
–
–
1.017
0.104
–
4.999
5.139
–
-ℓ
1343.250
869.423
867.436
AIC
2694.500
1748.846
1744.872
BIC
2709.498
1767.593
1763.619
The obtained observed information matrix of PQL regression model,
Iτ
, is
269.303398.54937.894151.3120.058398.54937.894151.3120.058656.87353.206224.8360.11753.20637.8950.001-0.084224.8360.001151.3050.0430.117-0.0840.0430.382
The diagonal elements of the inverse of
Iτ
give the variances of estimated parameters. The inverse of
Iτ
is
0.046-0.022-0.015-0.013-0.021-0.022-0.015-0.013-0.0210.0150.0010.001-0.0080.0010.0390.0120.0910.0010.0130.0190.025-0.0080.0910.02526.407
The asymptotic confidence intervals of regression parameters are
0.049<β1<-0.430
,
1.349<β2<0.961
and
1.215<β3<0.673
, respectively. As seen from estimated regression coefficients of PQL regression model, we conclude that the gender has no statistically significant effect on the days of absence for students. However, the days of absence for general and academic instructional program students are 1.348 and 0.945 times higher than the vocational instructional program students.
Conclusion
A re-parametrization of the Poisson quasi-Lindley distribution is introduced and studied comprehensively. The parameter estimation problem of the Poisson quasi-Lindley distribution is discussed via extensive simulation study. A new regression model for count data is proposed and compared with Poisson and negative-binomial regression models based on the real data set. We conclude that Poisson quasi-Lindley regression model exhibits better fitting performance than Poisson and negative-binomial regression models when the response variable is over-dispersed. We hope that the results given in this study will be very helpful for researchers studying in this field.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Bhati et al. (2017) A new count model generated from mixed Poisson transmuted exponential family with an application to health care data 46(22) (pp. 11060-11076) 10.1080/03610926.2016.1257712
Cheng et al. (2013) The Poisson–Weibull generalized linear model for analyzing motor vehicle crash data (pp. 38-42) 10.1016/j.ssci.2012.11.002
Déniz (2013) A new discrete distribution: properties and applications in medical care 40(12) (pp. 2760-2770) 10.1080/02664763.2013.827161
Farbod and Arzideh (2010) Asymptotic properties of moment estimators for distributions generated by Levy’s law 20(11) (pp. 55-59)
Gencturk and Yigiter (2016) Modelling claim number using a new mixture model: negative binomial gamma distribution 86(10) (pp. 1829-1839) 10.1080/00949655.2015.1085987
Grine and Zeghdoudi (2017) On Poisson quasi-Lindley distribution and its applications 16(2) 10.22237/jmasm/1509495660
Imoto et al. (2017) A modified Conway–Maxwell–Poisson type binomial distribution and its applications 46(24) (pp. 12210-12225) 10.1080/03610926.2017.1291974
Lord and Geedipally (2011) The negative binomial-Lindley distribution as a tool for analyzing crash data characterized by a large amount of zeros 43(5) (pp. 1738-1742) 10.1016/j.aap.2011.04.004
Mahmoudi and Zakerzadeh (2010) Generalized Poisson–Lindley distribution 39(10) (pp. 1785-1798) 10.1080/03610920902898514
Rodríguez-Avi et al. (2009) A generalized Waring regression model for count data 53(10) (pp. 3717-3725) 10.1016/j.csda.2009.03.013
Shmueli et al. (2005) A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distribution 54(1) (pp. 127-142) 10.1111/j.1467-9876.2005.00474.x
Sáez-Castillo and Conde-Sánchez (2013) A hyper-Poisson regression model for overdispersed and underdispersed count data (pp. 148-157) 10.1016/j.csda.2012.12.009
Shoukri et al. (2004) The Poisson inverse Gaussian regression model in the analysis of clustered counts data 2(1) (pp. 17-32)
Shanker and Mishra (2013) A quasi Lindley distribution 6(4) (pp. 64-71)
Wongrin and Bodhisuwan (2017) Generalized Poisson–Lindley linear model for count data 44(15) (pp. 2659-2671) 10.1080/02664763.2016.1260095
Zamani et al. (2014) Poisson-weighted exponential univariate version and regression model with applications 10(2) (pp. 148-154) 10.3844/jmssp.2014.148.154