In this paper, a new regression model for count response variable is proposed via re-parametrization of Poisson quasi-Lindley distribution. The maximum likelihood and method of moment estimations are considered to estimate the unknown parameters of re-parametrized Poisson quasi-Lindley distribution. The simulation study is conducted to evaluate the efficiency of estimation methods. The real data set is analyzed to demonstrate the usefulness of proposed model against the well-known regression models for count data modeling such as Poisson and negative-binomial regression models. Empirical results show that when the response variable is over-dispersed, the proposed model provides better results than other competitive models.

Introduction

The interest on count data modeling has been greatly increased in the last decade. The widely used distribution for modeling the count data sets is Poisson distribution. The well-known property of Poisson distribution is that its mean and variance are equal. Therefore, Poisson distribution does not work in the case of over-dispersion or under-dispersion. Poisson distribution is widely used in many research fields such as actuarial, environmental, actuarial and economics sciences in spite of its weakness. The reason for that comes from its simple form and easy implementation and software support. To remove the drawback of Poisson distribution, researchers have shown great interest to introduce mixed-Poisson distributions for modeling the over-dispersed or under-dispersed count data sets such as Bhati et al. [ ¹ ], Imoto et al. [ ⁷ ], Mahmoudi and Zakerzadeh [ ⁹ ], Gencturk and Yigiter [ ⁵ ], Wongrin and Bodhisuwan [ ¹⁵ ], Déniz [ ³ ], Cheng et al. [ ² ], Lord and Geedipally [ ⁸ ], Zamani et al. [ ¹⁶ ], Sáez-Castillo and Conde-Sánchez [ ¹² ], Rodríguez-Avi et al. [ ¹⁰ ], Shmueli et al. [ ¹¹ ], Shoukri et al. [ ¹³ ].

As mentioned above, Poisson distribution is insufficient to model the over-dispersed count data sets. The main motivation of this study is to introduce an alternative regression model for modeling the over-dispersed count data sets. Therefore, a re-parametrization of Poisson quasi-Lindley distribution, proposed by Grine and Zeghdoudi [ ⁶ ], is introduced and its statistical properties are studied comprehensively such as mean, variance and estimation problem of the model parameters. The maximum likelihood (ML) and method of moments (MM) estimation methods are considered to estimate the unknown parameters of the re-parametrized PQL distribution. The efficiencies of the estimation methods are compared with extensive simulation study. Using the re-parametrized Poisson quasi-Lindley distribution, a new regression model for over-dispersed count data sets is introduced. To demonstrate the effectiveness of proposed regression model, a real data set on days of absence of the high school students are analyzed with Poisson, negative-binomial and PQL regression models.

The rest of the paper is organized as follows: In “Re-parametrization of Poisson quasi-Lindley distribution” section, the statistical properties of the re-parametrized Poisson quasi-Lindley distribution are obtained. In “Estimation” section, ML and (MM) estimation methods are considered to estimate the unknown model parameters. In “Simulation” section, finite sample performance of estimation methods is compared via a Monte Carlo simulation study. In “Poisson quasi-Lindley regression model” section, a new regression model is introduced. In “Empirical study” section, a real data set is analyzed to demonstrate the usefulness of proposed model against the Poisson and negative-binomial regression models. “Conclusion” section contains the concluding remarks.

Re-parametrization of Poisson quasi-Lindley distribution

Let the random variable X follows a Poisson distribution. The probability mass function (pmf) is

Px;λ=exp-λλxx!,x=0,1,2,…,

where λ>0

0$$\end{document}]]>

. The mean and variance of Poisson distribution are EX=λ

and VarX=λ

, respectively. So, the dispersion index, shortly DI , for Poisson distribution is DI=VarX/EX=λ/λ=1

. As seen from the dispersion index of Poisson distribution, the over-dispersed or under-dispersed data sets cannot be modeled by Poisson distribution. Note that when the variance is greater than mean, the over-dispersion occurs; otherwise, it is called as under-dispersion. Grine and Zeghdoudi [ ⁶ ] introduced a new mixed-Poisson distribution, called Poisson quasi-Lindley (PQL), by compounding Poisson distribution with quasi-Lindley distribution, introduced by Shanker and Mishra [ ¹⁴ ]. The pmf of PQL distribution is given by

PY=y=θα+1αθ+1+θy+1θ+1y+2,y=0,1,2,…,

where θ>0

0$$\end{document}]]>

and α>-1

-1$$\end{document}]]>

. Hereafter, the random variable Y will be denoted as PQLθ,α

. The corresponding cumulative distribution function (cdf) to ¹ is

Fy=PY≤y=1-α+2θ+αθ+θy+11+α1+θy+2.

The mean and variance of PQL distribution are given by, respectively,

EY=2+α1+αθ

VarY=2+4α+α2+θα+2α+1α+12θ2

Here, the re-parametrization of PQL distribution is considered. The motivation of re-parametrization for PQL distribution comes from the generalized linear model approach.

Proposition 1

Let θ=2+α/1+αμ , then the pdf of PQL distribution is

PY=y=2+αα+12μα+2+αμ+αμ-1y+α+11+2+αμ+αμ-1y+2,y=0,1,2,…,

where α>0

0$$\end{document}]]>

and μ>0

0$$\end{document}]]>

. The mean and variance of ⁵ are given by, respectively,

EY=μ,VarY=μ⏟I+μ22+α22+4α+α2⏟II.

Note that the parameter α should be greater than zero to ensure the positive variance. The other statistical properties of PQL distribution, such as probability and moment generating functions, mode and its cdf, under the above re-parametrization can be obtained following the results in Grine and Zeghdoudi [ ⁶ ]. As seen from ⁶ , since the second part of variance equation for PQL distribution is greater than zero for all values of the parameters α and μ , the variance of PQL distribution is always greater than its mean. Therefore, PQL distribution can be a good choice for modeling the over-dispersed data sets.

Figure ¹ displays the dispersion index and possible shapes of PQL distribution. When the parameters α and μ increase, the dispersion of PX distribution increases. Note that the effect of the parameter μ on dispersion is higher than that of parameter α . As seen from left side of Fig. ¹ , PQL distribution can be a good choice for modeling extremely right-skewed data sets.

The dispersion index (right) and the pmf shapes of PQL distribution (left) for some values of α and μ

Generating random variables from Poisson-xgamma distribution

Here, a general algorithm and corresponding code written in R software are given to generate random variables from PQL distribution. The below code can be used for all discrete distributions such as Poisson, Poisson–Lindley, negative-binomial.

Estimation

In this section, ML and MM estimation methods are considered to estimate the unknown parameters of PQL distribution.

Maximum likelihood estimation

Let X1,X2,⋯,Xn be independent and identically distributed PQL random variables. The log-likelihood function is

ℓα,μ=nln2+αα+12μ+∑i=1nlnα+2+αμ+αμ-1yi+α+1-ln1+2+αμ+αμ-1∑i=1nyi+2.

Taking partial derivatives of ( ⁷ ) with respect to α

and μ

, we have

∂ℓ∂α=nα+2-1α+12μ1α+12μ-2α+2α+13μ-α+2αμ+μ+1-11αμ+μ-α+2μαμ+μ2∑i=1nyi+2+∑i=1nα+2αμ+μ+α+2μyi+α+1αμ+μ2+yi+α+1αμ+μ+1α+α+2yi+α+1αμ+μ

∂ℓ∂μ=∑i=1nα+1α+2yi+2αμ+μ2α+2αμ+μ+1-∑i=1nα+1α+2yi+α+1αμ+μ2α+2yi+α+1αμ+μ+α-nμ

The ML estimates of α,λ

can be obtained by means of simultaneous solutions of ⁸ and ⁹ . It is not possible to obtain explicit forms of ML estimates of PQL distribution since the likelihood equations contain nonlinear functions. For this reason, nonlinear minimization tools are needed to solve these equations. The nonlinear minimization ( nlm ) function of R software is used for this purpose. The corresponding interval estimations of the parameters are obtained by means of observed information matrix which is given by

IF(τ)=-IααIαμIμαIμμ.

The elements of observed information matrix are upon request from the authors. It is well known that under the regularity conditions that are fulfilled for the parameters, the asymptotic joint distribution of (α^,μ^)

, as n→∞

is a bi-variate normal distribution with mean (α,μ)

and variance–covariance IF-1(τ)

. Using the asymptotic normality, the asymptotic 100(1-p)%

confidence intervals for the parameters α

and μ

, respectively, are given by

α^±zp/2Var(α)^,μ^±zp/2Var(μ)^,

where zp/2

is the upper p / 2 quantile of the standard normal distribution.

Method of moments

The MM estimators of the parameters α and μ can be obtained by equating the mean and variance of PQL distribution to sample mean and variance, given as follows

y¯=μ,

s2=μ+μ22+α22+4α+α2,

where y¯

and s2

are the sample mean and variance, respectively. For simultaneous solution of ( ¹⁰ ) and ( ¹¹ ), we have

μ^MM=y¯,

α^MM=2y¯2y¯2+y¯-s2-2

Theorem 1

For fixed values of μ , MM estimator α^MM of α is consistent and asymptotically normal distributed:

nX¯-μ→N0,υ2θ,

where

υ2θ=2+4α+α2+θα+2α+1α+12θ2.

The detailed information about asymptotic properties of MM estimators can be found in Farbod and Arzideh [ ⁴ ].

Simulation

In this section, Monte Carlo simulation study is conducted to evaluate the finite sample performance of ML and MM estimates of PQL distribution. The following simulation procedure is used.

Set the sample size n and the vector of parameters θ=α,μT ;
Generate random observations from the PQLα,μ distribution, using the algorithm given in “Generating random variables from Poisson-xgamma distribution” section, with size n ;
Use the generated random observations in Step 2, and estimate θ by means of ML and MM estimation methods;
Repeat N times the steps 2 and 3;
Use θ^ and θ and calculate the biases, mean relative estimates (MREs) and mean square errors (MSEs) from the following equations:
Bias=∑j=1Nθ^i,j-θiN,MRE=∑j=1Nθ^i,j/θiN,MSE=∑j=1Nθ^i,j-θi2N,i=1,2.

Figure ² displays results of simulation study performed under the above procedure. The following parameters are considered: θ=α=0.5,μ=1.5T

, N=10,000

and n=40,45,50,…,500

. When n is sufficiently large, MREs should be closer to one and MSEs and biases should be closer to zero. As seen from Fig. ² , when the sample size, n , increases, the MSEs and biases are closer to zero and MREs approach to one for both estimation methods. The MM and ML estimation methods yield similar results for the parameter μ

in view of estimated MSE, bias and MRE. However, ML estimation method provides more satisfactory results for the parameter α,

especially for small sample sizes. Therefore, we suggest to use ML estimation method when the sample size is small.

Estimated biases, MSEs and MREs of the parameters of PQL distribution based on the ML and MM estimation methods

Poisson quasi-Lindley regression model

The Poisson and negative-binomial are the two commonly used regression models for count data modeling. When the response variable is not equi-dispersed, the negative-binomial regression model is preferable. Here, an alternative regression model is introduced for over-dispersed response variable.

Let random variable Y follow a PQL distribution, given in ( ⁵ ). The mean of Y is EY|α,μ=μ . Therefore, the covariates can be linked to the mean of response variable, y , by means of the log-link function, given by

μi=expxiTβ,i=1,…,n,

where xiT=xi1,xi2,…xik

is the vector of covariates and β=β0,β1,β2,…βkT

is the unknown vector of regression coefficients. Inserting ( ¹⁶ ) in ( ⁵ ), the log-likelihood function can be obtained as follows

ℓτ=nln2+α-∑i=1nlnα+12expxiTβ+∑i=1nlnα+2+αexpxiTβ+αexpxiTβ-1yi+α+1-∑i=1nyi+2ln1+2+αexpxiTβ+αexpxiTβ-1,

where τ=α,βT

. The unknown parameters, α

and β=β0,β1,β2,…βkT

, are obtained by maximizing ( ¹⁶ ) with the nlm function of R software. Under standard regularity conditions, the asymptotic distribution of (τ^-τ)

is multivariate normal Nk+2(0,J(τ)-1)

, where J(τ)

is the expected information matrix. The asymptotic covariance matrix J(τ)-1

of τ^

can be approximated by the inverse of the (k+2)×(k+2)

observed information matrix I(τ)

, whose elements are evaluated numerically via most statistical packages. The approximate multivariate normal distribution Nk+2(0,I(τ)-1)

for τ^

can be used to construct asymptotic confidence intervals for the vector of parameters τ

Empirical study

In this section, modeling ability of PQL regression model is compared with Poisson and NB regression models via an application on real data set. The data contain number of absence (daily), gender and type of instructional program of the 314 high school students from two urban high schools. The data set can be obtained from https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta . The response variable, number of absence yi , is modeled with gender (female = 1, male = 0) x1 and type of instructional program (general = 1, academic = 2, vocational = 3). The vocational program is used as a baseline category for type of instructional program variable. The general and academic instructional programs are coded as x2 and x3 , respectively. To decide the best model, the estimated negative log-likelihood value, Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) values are used. The lowest values of these statistics show the best-fitted model for the used data set. The following regression model is fitted.

μi=expβ0+β1x1i+β2x2i+β3x3i

Figure ³ displays the distribution of days of absence. The mean of response variable is 5.955, and variance is 49.518 which is an evidence for over-dispersion.

The distribution of days of absence of students

Table ¹ lists the estimated parameters of the models and corresponding SEs, estimated negative -ℓ , AIC and BIC values. Since PQL regression model has the lowest values of these statistics, we conclude that PQL regression model provides better fits than Poisson and NB regression models, especially for over-dispersed data set.

Table 1

The estimated parameters of models and goodness-of-fit statistics

Covariates	Poisson			NB			PQL
Covariates	Estimate (SE)	SE	p value	Estimate	SE	p value	Estimate	SE	p value
Intercept	1.323	0.089	<0.001	1.271	0.214	<0.001	1.273	0.215	<0.001
Gender	-0.234	0.047	<0.001	-0.193	0.123	0.118	-0.191	0.122	0.119
General	1.374	0.076	<0.001	1.362	0.199	<0.001	1.348	0.198	<0.001
Academic	0.957	0.066	<0.001	0.949	0.140	<0.001	0.945	0.138	<0.001
Dispersion	–	–	–	1.017	0.104	–	4.999	5.139	–
-ℓ	1343.250			869.423			867.436
AIC	2694.500			1748.846			1744.872
BIC	2709.498			1767.593			1763.619

The obtained observed information matrix of PQL regression model, Iτ , is

269.303398.54937.894151.3120.058398.54937.894151.3120.058656.87353.206224.8360.11753.20637.8950.001-0.084224.8360.001151.3050.0430.117-0.0840.0430.382

The diagonal elements of the inverse of Iτ

give the variances of estimated parameters. The inverse of Iτ

0.046-0.022-0.015-0.013-0.021-0.022-0.015-0.013-0.0210.0150.0010.001-0.0080.0010.0390.0120.0910.0010.0130.0190.025-0.0080.0910.02526.407

The asymptotic confidence intervals of regression parameters are 0.049<β1<-0.430

, 1.349<β2<0.961

and 1.215<β3<0.673

, respectively. As seen from estimated regression coefficients of PQL regression model, we conclude that the gender has no statistically significant effect on the days of absence for students. However, the days of absence for general and academic instructional program students are 1.348 and 0.945 times higher than the vocational instructional program students.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A new model for over-dispersed count data: Poisson quasi-Lindley regression model

Abstract

Introduction

Re-parametrization of Poisson quasi-Lindley distribution

Proposition 1

Fig. 1

Generating random variables from Poisson-xgamma distribution

Estimation

Maximum likelihood estimation

Method of moments

Theorem 1

Simulation

Fig. 2

Poisson quasi-Lindley regression model

Empirical study

Fig. 3

Table 1

Conclusion

Publisher’s Note

References