Bayesian Hurdle Poisson Regression for Assumption Violation

Violation of the Poisson regression assumption can cause the model formed will produce an unbiased estimator. There is a good method for estimating parameters on small sample sizes and on all distributions, namely the Bayesian method. The number of death due to chronic Filariasis data violates the Poisson regression assumption (overdispersion and response variable did not follow Poisson distribution), so it is modeled with the Bayesian Hurdle Poisson Regression. With the Bayesian method, convergence is fullfilled when 300000 iterations and 7 thin are performed. In addition to presenting an alternative method for estimating the Hurdle Poisson Regression parameter, the model obtained can be used by the Government in efforts to mitigate disease disasters through efforts to prevent, control, and handle cases of Filariasis. The results showed that in the logit model only the percentage of households that have access to proper sanitation in 34 Provinces in Indonesia had a significant effect on the number of death due to chronic Filariasis cases in 34 Provinces in Indonesia (𝑌) . The Truncated Poisson model resulted in all predictor variables having a significant effect on the number of death due to chronic Filariasis cases.


INTRODUCTION
An important assumption in Poisson regression analysis is that the response variable in the form of count distribute Poisson, does not occur multicollinearity in the predictor variable, and occurs equidispersion (the mean of the data is equal to its variance). However, in certain cases, the assumption of conformity of Poisson's distribution and equidispersion is not fullfilled. This can cause the model formed will produce an unbiased estimator [1].
Equidispersion violations or often known as overdispersion (variance greater than the mean) can be overcome with Zero Inflated model and Hurdle model. The handling of overdispersion in this study uses the Hurdle Poisson model because Hurdle model better than the Zero Inflated Model [2]. The parameter estimation method often used in the Poisson Hurdle model is Maximum Likelihood Estimation (MLE). However, MLE cannot estimate parameters on small sample sizes and on certain distributions. There is a good method for estimating parameters on small sample sizes and on all distributions, namely the Bayesian method. The advantage of the Bayesian method is that it can estimate parameters for extremely small observations and can be used for all distributions [3].
The application of the Bayesian method to overdispersion data has been carried out to analyze the number of Filariasis sufferers in Papua Province, using the Bayesian Zero Inflated Poisson model [4]. In this study will model data on the number of death from chronic Filariasis cases in Indonesia that violate the assumption of equidispersion and suitability of Poisson distribution with Bayesian Hurdle Poisson regression.
Filariasis or also known as elephant foot disease is believed to have existed since B.C. because in 1501-1480 BC found an ancient relief in a cemetery temple. Queen Hatshepsut in Thebet, Egypt who depicts the princess Punt suffering from Filariasis on her legs [5]. Filariasis in Indonesia is one of the endemic diseases (a disease that continues to infect certain regions) and was first reported by Haga and Van Eecke in 1889 in Jakarta caused by Brugaria Malayi [6]. Acute clinical symptoms of Filariasis disease include inflammation and swelling of the lymph canal accompanied by fever, headache, weak feeling and the onset of abscesses/ulcers while symptoms Chronic clinical is the occurrence of enlargement that persists in the legs, arms, breasts and genitals of women and men [7]. One of the efforts to inhibit the transmission of Filariasis disease is to Mass Preventive Drug Delivery (MPDD) Filariasis implemented by endemic Districts/Cities of Filariasis [5]. The success of the Filariasis control program can be known by looking at the number of districts/cities that managed to reduce the number of microphilia to <1% [8].
This study discusses the influence of the number of chronic cases of Filariasis in 34 Provinces in Indonesia ( 1 ), the number of Districts/Cities succeeded in reducing mikrophilia <1% in 34 Provinces in Indonesia ( 2 ) The results of this study can be utilized for many things, namely (1) Through the Bayesian Hurdle Poisson regression model that is built can be identified factors that affect the number of cases of chronic Filariasis death in Indonesia, so that this information can be utilized for appropriate policy making for the central and local governments and related agencies in order to mitigate the disaster of chronic Filariasis disease in Indonesia through prevention efforts, control, and handling of the case. (2) By using Bayesian parameter estimation approach, it is very useful and superior in various data challenge cases, namely for various sample sizes (any sample) small or large and various distributions (any distribution) with a data driven concept.

METHODS
This study uses secondary data from the Indonesian Health Profile in 2020, namely the number of cases of chronic Filariasis in 2020 with five predictor variables and one response variable [9]. The first step that must be done is testing the Poisson regression assumption (Poisson distribution suitability, non-multicollinearity, and overdispersion testing). The variables used in this study are the number of chronic cases of Filariasis in 34 Provinces in Indonesia ( 1 ), the number of Districts/Cities succeeded in reducing mikrophilia <1% in 34 Provinces in Indonesia

Poisson Regression Assumption
Poisson distribution suitability was tested with the Kolmogorov-Smirnov. Kolmogorov-Smirnov test statistics for testing the suitability of the Poisson distribution are presented in equation (1) ) or < 0.05, so we can conclude that response variable does not follow a Poisson distribution. Assumption of non-multicollinierity was tested with the criteria. If the exceeds 10, non-multicollinierity assumption is not fulfilled [11]. The third assumption test that must be done is the overdispersion test. The overdispersion test is carried out by calculating Pearson Chi Square divided by the degrees of freedom of residual based on the formula (2).

Bayesian Method
Suppose there are parameters to be estimated. In Bayesian method, parameters treated as variable will have value in the domain ( ). The prior distribution is the initial information to form the posterior. With prior information combined with data, calculating the posterior will be easier. Based on the Bayesian method, the posterior distribution is proportional (comparable) to the combination of the prior distribution and the likelihood function based on equation (3) [13].  (4).
The prior distribution for and is assumed to be normally distributed with the mean and variance 2 with the form as shown in equation (5).
The posterior distribution is obtained from the product of the likelihood function and the prior distribution in the form of an equation as presented in equation (6).
The posterior distribution of the Bayesian Hurdle Poisson Regression model parameters has a complex function and requires a difficult integration process, so it is not easy to obtain analytically. Therefore, a numerical approach is needed using the Markov Chain Monte Carlo (MCMC) simulation method [14].

Bayesian Model Convergence Test
Convergence test method consists of trace plot, autocorrelation plot, ergodic mean plot, and Monte Carlo Error (MC Error) [15]. Convergence will be fullfilled if the trace plot does not form an ascending or descending pattern, the autocorrelation plot is close to one and the next lag is close to zero, after several iterations the ergodic mean plot is stable, or MC error is less than 5% of the standard deviation of each parameter.

RESULTS AND DISCUSSION
The results of the analysis begin with testing the Poisson regression assumption, then the parameter estimator of the Bayesian Hurdle Poisson regression.

Result of Poisson Regression Assumption Test
The first assumption in Poisson regression is the response variable in the form of count with Poisson distribution based on hypothesis. The results of the Kolmogorov-Smirnov test with Software R showed that the less than 2.2 × 10 −16 . This suggests that the response variable did not follows a Poisson distribution. Then do fit distribution with EasyFit Software. Poisson distribution ranked third after uniform and geometric distribution. Since Poisson regression is the most common regression model for modeling response variable in the form of count, then no one has researched related to uniform regression and geometric regression, the study still uses Poisson's regression model, but uses the Bayesian method to estimate the parameters because they have advantages that can be applied to all distribution.
The next assumption is non-multocollinearity. The results of the multicollinearity test with the are presented in Table 1.  Table 1 shows that the of all predictor variables is less than 10, so it can be concluded that the non-multicollinearity assumption is fullfilled. The last assumption in Poisson regression is the occurrence of equidispersion. Overdispersion testing was carried out with 2 ⁄ . Data is said to contain overdispersion if ( 2 ⁄ ) > 1. The 2 ⁄ = 212.549, it can be concluded that the data contains overdispersion. Because the two Poisson regression assumptions are not fullfilled, then estimate the parameters with the Bayesian Hurdle Poisson regression model.

Result of Bayesian Model Convergence Test
In Bayesian method, parameters are generated using the Gibbs Sampling algorithm with 300000 iterations and 7 thin. It is important to check the convergence of the model parameters to check the accuracy of the parameter estimation using the Bayesian method. There are four methods for checking the convergence of parameters, namely (1)   The Figure 1 shows that the trace plot is random when 300000 iterations are carried out and 7 thin. It can be concluded that the parameters are convergent, so the iteration is stopped. The second method used to check the convergence is the autocorrelation plot. The Figure 2 shows the autocorrelation plot for each parameter. The Figure 2 shows that the first lag in the autocorrelation plot is close to one and the next lag is close to zero, so the convergence of parameters is fulfilled. The third method used to check convergence is the ergodic mean plot. Convergence will be fullfilled if after several iterations the ergodic mean plot is stable. The Figure 3 shows the ergodic mean plot for each parameter. The Figure 3 shows that after 300000 iterations and 7 thin the ergodic mean plot is stable. It can be concluded that the parameters are convergent. In addition to using plots, convergence checks can also be done by comparing the MC error with 5% standard deviation for each parameter. The MC error for each parameter of the Bayesian Hurdle Poisson regression model are presented in the Table 2.  Table 2, MC error on all parameters is less than 5% standard deviation, then the convergence is met. Based on the four methods of checking the convergence, the results are the same, namely the convergence is fulfilled when 300000 and 7 thin amere performed.

Parameter Estimation Results of Bayesian Hurdle Poisson Regression Model
After the convergence is fullfilled, we can calculate the parameter estimator obtained from the sample generation using Gibbs Sampling. The parameter estimator is the average of the sample generation results for each parameter which is shown in Table 3. Testing

Bayesian Hurdle Poisson Regression for Assumption Violation
Nur Kamilah Sa'diyah 391 the Bayesian model parameters using a confidence interval by looking at the lower limit of the 2.5% percentile and the upper limit of the 97.5% percentile. If it contains zero in that range, the decision to accept 0 or the th predictor variable has no significant effect to the response variable. Based on Table 3, the Bayesian Hurdle Poisson Regression model can be presented as follows ̂= 16,6551 − 0,1922 5 (6) ln̂= −4,2404 − 0,0027 1 + 0,1500 2 + 0,5121 3 − 0,0004 4 + 0,0805 5 (7) The interpretation of the logit model in equation (6), that is, every 1% increase in the percentage of households that have access to proper sanitation in 34 Provinces in Indonesia will increase the probability of the number of cases of death due to chronic Filariasis in 34 Provinces in Indonesia by exp(-0.1922) = 0.825 times of the original number of death from chronic Filariasis cases.
The interpretation of Poisson's truncated model in equation (7)

CONCLUSIONS
In the logit model, the percentage of households that have access to proper sanitation in 34 Provinces in Indonesia ( 5 ) has a significant effect on the number of cases of death due to chronic Filariasis in 34 Provinces in Indonesia ( ). Then in the Truncated Poisson model, all predictor variables, namely the number of all chronic cases of Filariasis in 34 Provinces in Indonesia ( 1 ), the number of district/cities managed to reduce microphilia <1% in 34 Provinces in Indonesia ( 2 ), the number of district/cities that are still implementing the Mass Preventive Drug Delivery (MPDD) for Filariasis in 34 Provinces in Indonesia ( 3 ), population density in 34 Provinces in Indonesia ( 4 ), as well as the percentage of households that have access to proper sanitation in 34 Provinces in Indonesia ( 5 ) have a significant effect on the number of deaths due to chronic Filariasis in 34 Provinces in Indonesia (Y).