A Study of Count Regression Models for Mortality Rate

In this study, Poisson regression model, Negative Binomial 1 regression model (NEGBIN 1) and Negative Binomial regression 2 (NEGBIN 2) model were proposed to fit mortality rate data. The method used is comparing the values of Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to find out which method suits the data the most. The results show that the data indeed display higher variability. Among the three models, the model preferred is NEGBIN 1 model.


INTRODUCTION
Count data contain variables that count how many times something has happened, such as the number of cases with a particular disease in epidemiology [1]. Linear regression models have often been applied to handle this kind of data, but the results are inefficient, inconsistent, and biased. This type of data is considered as count data with variable offset. Mortality data is considered as the amount of data that contains the offset variable.
A study of mortality for middle-aged men on ischemic heart disease (IHD) that affects mortality has been conducted by [2]. The results showed that there were 46 of 109 deaths around 11.4 years of follow-up due to IHD. In addition to studies on causes of death other than IHD, [3] has researched the global impact of HIV/AIDS. Another study on mortality was conducted by [4] about the diarrheal disease. It has been found that diarrhea causes 1 in 9 child deaths worldwide, the second leading cause among children under 5 years of age. In addition, [5] examined the global causes of death due to disease in children under 5 years. In their study, diarrhea remained the second leading cause of death in children from infection in the last 30 years.
In addition, malnutrition is said to be one of the world's worrisome problems. It affects about 6 million child deaths every year. [6] studied that poor nutrition during fetal development can cause severe physical damage, and malnutrition always increases susceptibility to disease. A study conducted by [7] stated that malnutrition (measured as poor anthropometric status) accounted for nearly 50% of childhood deaths.
Regarding the problem of mortality due to disease, [8] stated that the trend of injuries and deaths from road traffic accidents (RTA) is becoming severe in countries such as India. Not a day goes by without an RTA in India; many people die or become disabled. In addition, suicide is one of the factors that contribute to the death rate. In a study by [9], suicidal behavior has always been a major health problem in many countries, both developed and developing countries.
Poisson regression model is one of the general linear models for data with offset variables. It is also the standard model for calculating data and contingency tables. In this model, the response variable is assumed to have a Poisson distribution. In addition to Poisson regression, Negative Binomial regression is also a generalized linear model where the dependent variable is the number of events. The Negative Binomial distribution is a two-parameter distribution that is generally more flexible than the Poisson model [2]. This model can also model scattered quantities, which the Poisson model cannot. The Negative Binomial model can be derived from the Poisson distribution and the generalized Poisson distribution.
[10] has discussed several other specific mortality measures, such as age-specific crude death rates, cause-specific mortality rates, and infant and maternal mortality rates. In the data collection process, there may be biased and inaccurate data measurements. The inaccuracy of this data collection will cause overdispersion. This study aims to identify the most suitable method when dealing with mortality data which usually has overdispersion.

Data
The data used in this study is mortality rate data which is available in [11]. The data consists of 163 observations (countries) with seven independent variables, which are the number of people dying per 100,000 live births due to IHD ( 1 ), diarrheal disease ( 2 ), HIV/AIDS ( 3 ), malaria ( 4 ), malnutrition ( 5 ), road accidents ( 6 ), and suicides ( 7 ).

Count Regression Models
According to [12], the count regression model has been suggested to be used to model over-dispersed and zero-inflated count response variables. Poisson regression is the standard model for modeling count data, while the Negative Binomial regression model is often introduced to solve count data with overdispersion. Meanwhile, the zero-inflated Poisson model (ZIP) and the zero-inflated Negative Binomial model (ZINB) are introduced to solve a zero-inflated variable in which the data contains many zeros. Moreover, [13] found that ZIP and ZINB can be obtained by mixing a distribution degenerate at zero with a Poisson regression and Negative Binomial regression, respectively.
The probability mass function of the ZIP is, Meanwhile, the ZINB's probability mass function can be formulated as: with = ( ). The 0's arise with probability from a second process. The function F that relates to the product to the probability is named as the zero-inflated link function, = ( ). [14] studied about Poisson regression model as the standard model for count data. A variable Y is a count of events of Poisson regression, and the marginal probability of Poisson regression is written as:

Poisson Regression Model
with = ( + ); = 0,1, . . . . The rate parameter of Poisson regression is and it is also known as its expected count is formulated as: Based on Equation (4), the Log-linear model for mean rate is written as: with p is the number of predictors or covariates in the model, 0 is the intercept of the regression, are the regression coefficients, and is the independent variable. [14] formulated Maximum Likelihood Estimation (MLE) of Poisson regression. Let Y be a random variable with Poisson distribution and with an unknown parameter value . The probability mass function of Y is obtained, which is ( ; ) to emphasize the parameter and n is the independent trials in order to get the data 1 , 2 , 3 , . . . , . The joint probability mass function is as follows: The likelihood of given data 1 , . . . , can be obtained from Equation (6) The estimated value maximizes the Maximum Likelihood Estimates =̂. Y follows a Poisson distribution with unknown parameters, and the data is collected from the independent trials are of the form 1 = 1 , 2 = 2 , . . . , = . On the other hand, the likelihood function of the Poisson regression is written as: The log-likelihood function of Poisson regression is obtained by applying the logarithm of .

The Standard Negative Binomial Regression Model
According to [1], in most applications, the mean of the data is usually greater than the variance. If otherwise, it is called overdispersion in the particular data. But, based on the study of [15], the Poisson regression model is inefficient when dealing with overdispersed data. While in a study by [16], the Negative Binomial distribution is more flexible than Poisson distribution as it is a two-parameter when modeling the data with overdispersion. Particularly, Negative Binomial regression can model overdispersed counts. The Negative Binomial model can be derived as a mixture of the Gamma-Poisson model. Starting from the conditional mean of the Poisson model, where ℎ = ( ). In the case of the Poisson-Gamma distribution, ( , ) is the Poisson distribution while ℎ = ( ) follows Gamma distribution. The ℎ is assumed to follow a two-parameter Gamma distribution, Once ℎ has been integrated out from the joint distribution, then the marginal probability of Negative Binomial distribution is obtained as follows: The mean of Negative Binomial is the same as Poisson regression, which is written as ( | ) = = and the variance of a Negative Binomial is written as: where = (ℎ ). Moreover, the rate parameter of negative regression , which is also known as its expected counts, is written as: The Log-linear model for the mean rate of Negative Binomial regression can be obtained by applying the logarithm of Equation (13): where p is the number of predictors or covariates in the model, 0 is the intercept of the regression, are the regression coefficients, and x's are the independent variables. [17] has discussed the MLE of Negative Binomial regression in which random samples of n subjects are given. In a standard Negative Binomial model, the dependent variables and the predictor variables 1 , 2 , … , are included. Predictor variables are combined to form the following matrix, .
The ℎ row of X is designated to be , from Equation (11), the in which is replaced by . The Equation can be rewritten as, The likelihood function of Negative Binomial is stated as below, and the log-likelihood function of Negative Binomial regression is obtained by applying the logarithm to obtain the following equation:  [18] have shown the Equation (9) is considered as a Negative Binomial 2 (NEGBIN 2) model. They re-parameterized the NEGBIN 2 model, and it is labeled as specification Negative Binomial 1 (NEGBIN 1), which is written as, The marginal probability of NEGBIN 1 is obtained by replacing with in Equation (11), By replacing with 2− in Equation 19, the Negative Binomial P (NEGBIN P) model is written as: Overdispersion [19] have proposed that in almost the statistical study for the count data, it is always assumed that the dependent variable follows the Poisson distribution. The mean is assumed to be equal to the variance. However, in real life, the variance is usually larger than the mean. [19] also stated that overdispersion indicates high variability around a model's fitted values in the Poisson formulation. This case will lead to a Negative Binomial model as a proposal to correct this problem.
When the data are over-dispersed, the variance is not the same as its mean, or ( ) = , where is the mean. If = 1, the Poisson model is ordinary; if > 1, it means that the model is overdispersed model. Consequently, [20] stated that a unique property of distributions in exponential families is the conditional variance equal the conditional mean. The dispersion parameter, . In the Poisson model, the dispersion parameter is set to constant value = 1.

Count Data
According to [16], count data indicates how many times or how frequent something happens. Furthermore, [18] stated that an event outcome is the number of times an event occurs while an event count is a nonnegative random variable. The examples of count data included the number of patients hospitalized, the number of thieves arrested, and the number of natural disasters.
In some cases of count, data have offset variables. [21] said that offset variable is always being analyzed by the generalized linear model (GLM) and count regression model. The analysis is usually used whenever the data is recorded over an observed period. Offset is used to denote the period observed in GLM. Other than that, offset is usually defined as a measure of exposure. The exposure can be the number of house years incurred, and the response will be the number of claims incurred.
The log-linear mean rate for Poisson regression and Negative Binomial model is, where p is the number of predictors or covariates in the model, 0 is the intercept of the regression, are the coefficients of the regression, is the independent variable, t is the period observed (exposure), log (t) is the offset variable and is the rate. In this study, our interest is in modeling for the mortality data, which is count data. Poisson regression and Negative Binomial regression are generally appropriate to deal with the count data. In this research, our interest is to find out which regression best fits the mortality data.

Modelling the Mortality Rate Data
Poisson regression and Negative Binomial regression are the main study in this research in modeling the data. The model for Poisson model and Negative Binomial model are written as Equation (22), where p is the number of predictors or covariates in the model, 0 is the intercept of the regression, is the covariate coefficients, and is the independent variable. The ( ) represents the number of people dying per time unit and the function βx is the relationship of death rate changes as a function of subject covariates. The null hypothesis states the slope is equal to zero, whereas the alternative hypothesis indicates the slope is not equal to zero.

Goodness-of-fit Test
Deviance and Person's Chi-Square will be carried out to check if the data has overdispersion or under-dispersion. The results of deviance and Pearson's Chi-Square that are divided by the degree of freedom (df) should be approximately equal to one. If the values are more than one, this indicates that the data is overdispersion. Goodness-of-fit is performed by using the PROC GENMOD statement in SAS. Deviance for fitted Poisson regression and Negative Binomial regression is written as: And the Pearson's Chi-Square is defined as, where = 0 + 1 1 +...+ .

A Study of Count Regression Models for Mortality Rate
Anwar Fitrianto 148

Mortality Rate Data Models
PROC GENMOD statement in SAS version 9.4 was used to run the Poisson regression analysis. At 5% level of significance, all independent variables contributed significantly to the mortality rate with the following estimated Poisson regression model (Table 1): .5834 + 0.0008 1 + 0.0039 2 +0.0010 3 +0.004 4 -0.003 5 -0.0123 6 +0.0081 7 The estimated Poisson model, along with the standard error of each estimated coefficient and p values, indicated that the IHD, diarrheal disease, AIDS/HIV, malaria, malnutrition, road accidents and suicides were significant predictors contributing to the mortality rate. As an alternative to the Poisson regression model, the data were also analyzed using the Negative Binomial model. Table 2 displays the result of the analysis based on maximum likelihood estimation for the Negative Binomial regression. Fitting the data using the Negative Binomial regression model found that all independent variables are except 2 (diarrheal disease) and 5 (malnutrition) contribute significantly to the mortality rate. Both variables have a more considerable p value (0.0755 for diarrheal diseases and 0.2750 for malnutrition). Hence, diarrheal disease and malnutrition were not significant predictors, while the other variables IHD, AIDS/HIV, malaria, road accidents, and suicides, were the significant predictors. The predicted model using the Negative Binomial regression model for the mortality rate data is written as, ( ( ) ) =6.5602+0.0008 1 +0.0046 2 +0.0011 3 +0.0054 4 -0.0041 5 -.0138 6 +0.0112 7

Descriptive Statistics of the Variables for Checking Overdispersion
When the variance of a particular variable is higher than its mean, it indicates that the data has overdispersion. In this study, the dependent variable's mean and variance were 824.0061 and 105125.22, respectively, indicating overdispersion. Table 3 displays the means and variances of all the variables in the study. All the variables were overdispersed and more considerable variability was given around a model's fitted values in Poisson regression, ( ) = , >1. As a consequence, the Negative Binomial regression was the better approach for modeling over-dispersed count data.

Goodness-of-fit Test for Poisson Regression and Negative Binomial Regression
The main purpose of the goodness-of-fit test is to determine a more appropriate model. Table 4 presents the deviance and Pearson's Chi-Square to observe whether the deviance and Pearson's Chi-Square obtained close to one. The value/df column of Deviance and Pearson's Chi-Square for the Poisson model were 97.3008 and 91.5904, respectively, which were remarkably higher than one. The Poisson model did not correctly describe the data. There was more significant variability among counts than will be expected for Poisson distribution. This situation arises because repeated subjects may not be independent. One of the possible reasons for the overdispersion is that experimental conditions are not under control, hence varies with uncontrolled factors. The table shows that the Negative Binomial regression was the better alternative to model the mortality rate. The value/df of the Deviance and Pearson's Chi-Square were 1.0785 and 0.8458, respectively. Both values were closer to one as compared to the corresponding values in the Poisson regression model.

Comparison between Poisson Regression, Negative Binomial 1 and Negative Binomial 2.
Comparisons between all the three proposed models for the mortality data were given in Table 5. The AIC for Poisson regression was larger compared to the other two. The AIC value for NEGBIN 1 was slightly smaller than the one for NEGBIN 2. It indicated that NEGBIN l was a better fit than Poisson regression and NEGBIN 2. On the other hand, the BIC values for the three regressions were 16501, 2345, and 2347, respectively, for Poisson, NEGBIN 1, and NEGBIN 2. The BIC value for Poisson regression was much higher when compared to the Negative Binomial regressions. Thus, with lower AIC and BIC values, the NEGBIN 1 was the better approach for the mortality rate data since it can explain more variation with the same number of independent variables..

CONCLUSIONS
The analysis was conducted to compare the performance of three models: Poisson regression, NEGBIN 1 and NEGBIN 2. The NEGBIN 1 has been proven that it is the most appropriate model for overdispersed data. The mean and the variance were calculated to ensure that data has overdispersion. Since the data were overdispersed, the results of deviance and Pearson's Chi-Square showed that Negative Binomial was a better model for the data. Then, the performance of AIC and BIC showed that NEGBIN 1 is a better model, followed by NEGBIN 2 and Poisson regression.