LIVER DISEASE CLASSIFICATION ANALYSIS USING THE XGBOOST METHOD

-Liver disease is a severe pathological condition that can cause liver inflammation due to viral infection, toxic agents, or bacterial invasion, interfering with normal liver function. The death rate from this disease reaches 1.2 million people annually in Southeast Asia and Africa. Liver disease can cause damage to the liver and negatively affect overall body function. To reduce disease progression, it is critical to facilitate early diagnosis, thereby enabling rapid initiation of treatment for affected individuals. Classification methods are widely used to make decisions based on new information from previous data processing through calculation algorithms. This study uses the XGBoost classification method to build a predictive model for liver disease. The results of this study confirm that the XGBoost model is a robust and efficient choice for liver disease classification based on patient data. The use of the XGBoost approach has proven its success in the category of liver disease with an accuracy of up to 95% and an accuracy balance of 95%, demonstrating the effectiveness and efficiency of this method in overcoming class imbalances in liver disease classification data.


INTRODUCTION
Liver disease is a significant pathological condition affecting the liver, characterized by liver inflammation due to viral infection, exposure to toxic agents, or bacterial invasion, thus interfering with normal liver function [1] [2].According to data published by the World Health Organization (WHO), it has been reported that around 1.2 million people annually in the Southeast Asia and Africa region die from this particular disease [3].Liver disease can cause damage to the liver, causing a gradual decline in the body's ability to function correctly.
Liver disease can be caused by various factors, such as congenital liver defects, viral or bacterial infections, alcohol addiction, smoking, unhealthy lifestyle choices, and other similar activities [4] [5].One of the main functions affected is the liver's ability to neutralize toxins that enter the body.Left untreated, it can harm the body [6].To reduce disease progression, it is essential to facilitate early diagnosis, thereby enabling rapid initiation of treatment for individuals with liver disease [7].
Classification methods are widely used to make decisions based on new information from processing previous data through algorithm calculations [8].Xgboost is the method used to build a Liver disease classification model in this study.XGBoost (Extreme Gradient Boosting) is a viral and effective machine learning algorithm for tasks such as classification and regression.
One of the main advantages of XGBoost is its superior performance, which often outperforms many other machine learning methods in competition and prediction tasks [9].In addition, XGBoost is optimized for efficiency and scalability to work well on large data sets with many features [10].XGBoost can also handle non-linear relationships between components and targets and has various options to address class imbalance issues in classifications.
To provide references and comparative material, the researcher conducted a literature study that included several previous studies related to the methods used to predict inflammatory liver disease by [11][12]; in both studies, the Decision Tree method was used (C4.5) in its classification.Produces an accuracy of 70.29% and an AUC value of 0.714, Naïve Bayes produces an accuracy of 67.05% and an AUC value of 0.757, while SVM obtains an accuracy of 78%.
Based on this background presentation, developing a method for classifying liver disease is necessary.Hence, the researchers conducted a study entitled "Analysis of Liver Disease Classification Using the Xgboost Method."

Research Methods
The liver is a vital organ in the human body; it is responsible for converting harmful compounds into essential nutrients, which are then used by the body to regulate hormone levels within its physiological framework [13].The liver also plays a vital role in synthesizing hormones and proteins, regulating blood glucose levels, and facilitating hemostasis [14].
Classification is a fundamental machine learning procedure involving developing models capable of distinguishing between classes.This model is built by identifying certain traits or characteristics that distinguish one type from another [15].The classification process involves making a model using pre-existing training data, which is then used to classify the newly acquired data.Classification can be defined as training or learning a model on the target function, which maps a set of properties (also known as features) to a group of available class labels [16].
XGBoost, also known as Extreme Gradient Boosting, is a machine learning technique that integrates the principles of gradient boosting and boosting.Classification models built using boosting techniques involve making predictions based on previous model errors to produce the following model [17].The algorithm used in this context is called gradient enhancement, which uses gradient descent techniques to reduce mistakes while constructing new models effectively.
XGBoost is a machine learning algorithm that belongs to the ensemble tree family, specifically using classification and regression trees (CART) [18].Improvement methodologies require iterative procedures aimed at improving classification performance.On the other hand, the ensemble tree approach combines multiple decision trees to produce superior performance compared to a single decision tree.

Research Workflow
In conducting research, a workflow diagram is determined so that it runs well and can be completed on time.The research flowchart that we can see in Figure 1

Data Acquisition
Data acquisition refers to a systematic procedure of collecting data from various sources to carry out further analysis or processing activities.The process involves acquiring data from multiple sources, including databases, sensors, hardware, websites, and other relevant data sources.

Data Preprocessing
Data pre-processing refers to sequential procedures or methodologies that are executed before using data for further analysis or processing.The main goal of data pre-processing is to effectively clean, organize, and filter data to align it with the specific requirements of the following analysis or processing task.

Model Building Xgboost
XGBoost model construction involves developing predictive or classifying models with the XGBoost algorithm.XGBoost, or Extreme Gradient Boosting, is a mighty and efficient machine learning algorithm for various tasks, including classification, regression, and ranking.

Accuracy Result
The concept of result accuracy, also known as prediction accuracy or classification accuracy, refers to evaluating the extent to which a constructed prediction or classification model gives accurate results.Accuracy evaluation metrics are often used to assess a model's ability to predict or classify data accurately.

Data Acquisition
Based on data obtained from the Kaggle public data provider platform[19], which is a dataset of liver disease patients, which has a total data of 483 rows and 11 columns, each column containing Age, Gender, Total_Bilirubin, Direct_Bilirubin, Alkaline_Phosphotase, Alamine_ Aminotransferase, Aspartate _Aminotransferase, Total_Protiens, Albumin, Albumin_ and_Globulin_ Ratio, Liver_Disease.The stages of data acquisition can be seen in Figure 2:

Data Processing Flow
For the research to run well, the researcher makes a data processing flow that aims to keep the study running based on the predetermined flow.The flow of processing this data is in Figure 3 below:

Data preprocessing
The data obtained then enters the preprocessing stage, where the dataset related to the construction and training of machine learning models is prepared.At this stage, it also aims to improve the dataset's quality.The steps are carried out as follows:

Data Manipulation
At the data manipulation stage, the aim is to change the English language contained in each dataset column to Indonesian so that it can more easily understand the contents of the owned dataset.The process can be seen in Figure 4 below:

Missing Values Handling
This stage checks for missing data in the dataset, and if the data is missing, then the missing data will be patched with the mean, median, and max values.In this study, the missing data in the dataset was restored using the mean value.The stages of the disappeared values handling process can be seen in Figure 5 below:

Feature Engineering
At the feature engineering stage or feature engineering, the features in the dataset are engineered into two different classes, where the classes are 0 and 1, taken from the liver disease column, meaning 0 is a class that is not and 1 is a class yes.The engineering feature stages can be seen in Figure 7 below: From the results obtained, see class 0 and class 1 that there is an imbalance between classes which is often called Imbalance data, where data with class 1 has more value of 399 and for class 0, and there are as many as 133.In both types, the imbalance data stage that is carried out is to resample the data to 500 for each class.Data resampling can be seen in Figure 8 below:

Unused Column Removal
The final step in the data preparation process is the deletion of columns that are not used in the classification model to be implemented.The column that needs to be removed from the dataset includes Gender.The following section describes the sequential steps involved in deleting unused columns.

Exploratory Data Analysis
This study involved a series of exploratory data analysis (EDA) stages, which involved examining the distribution of liver disease cases based on age, Gender, Total Bilirubin (mg/dL), Direct Bilirubin (mg/dL), Alkali Phosphatase (IU/L), Alanine Aminotransferase (IU/L), Aspartate Aminotransferase (U/L), Total Proteins, Albumin, Albumin and Globulin Ratio, and Liver Disease. Figure 10 provides an overview of these stages.

Making the Determinant Variable (X)
This stage aims to initialize variable X which functions as an independent variable in forming a machine-learning model.This X variable includes the dataset's relevant columns, totaling 9 columns.These stages can be seen in Figure 11 below:

Creating Variables (Y)
This stage also functions to create a target variable with a variable name (Y), where this variable serves as a target model in determining accuracy results.This determinant variable is used for training and testing with the "Liver Disease" column as the target variable.These stages can be seen in Figure 12 below:

Distribution of Train and Test Data
The purpose of this process is to divide the dataset into two parts.Where in this study, the data was split into 80% for training and 20% for testing, using random state or randomizing the data 101 times.These stages can be seen in the following figure:

Model Accuracy
The Xgboost Classifier algorithm is applied as a method for creating a classification model to identify the causative factors of liver disease to achieve optimal accuracy.

Xgboost Accuracy
This study uses Xgboost with default settings with the library used by the Xgboost Classifier.

Figure 14 Import Library Xgboost
Based on the modeling that has been made, the accuracy of the Xgboost algorithm is 95%.The results of the accuracy of the Xgboost algorithm can be seen in Figure 15 below:

Model Evaluation
The results of the confusion matrix stage of the Xgboost algorithm can be seen as true positives of 97, false positives of 2 with a total of 99 tests, true negatives of 93, and false negatives of 8 with a total of 101 tests.These stages can be seen in Figure 16 below:

CONCLUSION
The XGBoost approach has demonstrated successful applicability in the classification of liver disease, resulting in the development of predictive models with excellent performance.The XGBoost model exhibits a high level of accuracy, as evidenced by findings of 95% accuracy and 95% accuracy balance.In addition, the model demonstrates a laudable ability to address class imbalances in data sets effectively.The evidence above suggests that using available patient data, the XGBoost technique is a robust and efficient option for liver disease classification.

Figure 1
Figure 1 Research Diagram

Figure 2
Figure 2 Data Acquisition

Figure 3
Figure 3 Data Processing Flowchart

Figure 4
Figure 4 Data Manipulation

Figure 5
Figure 5 Missing Value Handling

Figure 6
Figure 6 Duplicate Removal 3.3.4Feature EngineeringAt the feature engineering stage or feature engineering, the features in the dataset are engineered into two different classes, where the classes are 0 and 1, taken from the liver disease column, meaning 0 is a class that is not and 1 is a class yes.The engineering feature stages can be seen in Figure7below:

Figure 7
Figure 7 Future Engineering

Figure 9
Figure 9 Remove Unused Columns

Figure 10
Figure 10 Exploratory Data Analysis

Figure 3 .
Figure 3.11 Making Variable X

Figure 12
Figure 12 Creating Variable Y

Figure 13
Figure 13 Distribution of Train and Test

Figure
Figure 15 Xgboost Accuracy