XGBoost for IDS on WSN Cyber Attacks

Telechargé par maryam SALAMI
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/366433530
XGBoost for IDS on WSN Cyber Attacks with Imbalanced Data
Conference Paper · November 2022
DOI: 10.1109/ISESD56103.2022.9980630
CITATIONS
19
READS
107
4 authors, including:
Aji Gautama Putrada
Telkom University
135 PUBLICATIONS897 CITATIONS
SEE PROFILE
Syafrial Fachri Pane
Politeknik Pos Indonesia
56 PUBLICATIONS315 CITATIONS
SEE PROFILE
Mohamad Nurkamal Fauzan
Universitas Logistik dan Bisnis Internasional
54 PUBLICATIONS154 CITATIONS
SEE PROFILE
All content following this page was uploaded by Aji Gautama Putrada on 24 March 2024.
The user has requested enhancement of the downloaded file.
XGBoost for IDS on WSN Cyber Attacks with
Imbalanced Data
Aji Gautama Putrada
Advanced and Creative Networks
Research Center
Telkom University
Bandung, Indonesia
ajigps@telkomuniversity.ac.id
Nur Alamsyah
Advanced and Creative Networks
Research Center
Telkom University
Bandung, Indonesia
[email protected]versity.ac.id
Syafrial Fachri Pane
Advanced and Creative Networks
Research Center
Telkom University
Bandung, Indonesia
[email protected]versity.ac.id
Mohamad Nurkamal Fauzan
Advanced and Creative Networks
Research Center
Telkom University
Bandung, Indonesia
[email protected]versity.ac.id
Abstract—A wireless sensor network (WSN) is also vulnerable
to cyber-attacks, just other systems connected to the computer
network, which makes the intrusion detection system (IDS) for
WSN an interesting research study. However, IDS datasets are
usually associated with imbalanced data because attacks usually
occur in low frequency. This study proposes the application
of XGBoost in IDS on WSN cyber attacks that experience
imbalanced data. We obtained the attack dataset on WSN
from Kaggle, which data on blackhole, grayhole, flooding, and
scheduling attacks. We use decision trees and naive Bayes to
benchmark the performance of our proposed method. Then the
precision, recall, receiver operating curve (ROC), and area under
curve (AUC) value is to evaluate our IDS model. The test results
show that the three classes have moderate imbalanced data, while
one class, the flooding attack class, has severe imbalanced data.
Compared to the two benchmark methods, decision tree and
naive Bayes, XGBoost has the best AUC for scheduling, normal,
grayhole, flooding, and blackhole classes with values of 0.987,
0.9963, 0.9994, 0.9997, and 0.9999 respectively.
Index Terms—intrusion detection system, wireless sensor net-
work, extreme gradient boosting, data imbalance
I. INTRODUCTION
Wireless sensor networks (WSN) is an emerging topic
which, as the name suggests, is a sensor that is spread out and
connected to a computer network to monitor certain values in
its implementation environment [1]. WSN research is about
optimization of network topology [2], optimization of cluster
head selection [3], and optimization of routing [4]. WSN
application areas are around agriculture [5], gas, and fire
detection [6]. Because it connects to the computer network,
WSN is also vulnerable to cyber attacks, so the intrusion
detection system (IDS) for WSN is also a concern [7].
Thank you to the Directorate of Research and Community Service (PPM)
Telkom University for funding this research.
IDS can use several machine learning methods as detection
methods in WSN. Gite et al. [8] implements a decision tree on
WSN to detect blackhole, wormhole, grayhole, and distributed
denial of service (DDoS) attacks with an accuracy of 70%.
Mehmood et al. [9] made an IDS to detect DDoS flooding on
WSN with na¨
ıve Bayes. However, the IDS dataset is usually
associated with imbalanced data because attacks usually occur
in a low-frequency [10].
Several studies use extreme gradient boosting (XGBoost)
for the detection method on imbalanced data [11]. Qiu et
al. [12] applies XGBoost to credit card fraud detection and
shows that XGBoost is better than other methods of detecting
imbalanced data. Applying XGBoost in imbalanced data on
IDS for WSN is a research opportunity.
This study proposes the application of XGBoost in IDS
on WSN cyber attacks that experience imbalanced data. We
obtain the attack dataset on WSN from Kaggle, which contains
data on blackhole, grayhole, flooding, and scheduling attacks.
We use decision trees and naive Bayes to benchmark the
performance of our proposed method. We use precision, recall,
receiver operating curve (ROC), and the area under curve
(AUC) value to evaluate our IDS model.
To the best of our knowledge, there has never been a study
that has applied XGBoost for IDS on WSN cyber attacks that
have imbalanced data. Here are our research contributions:
1) a fast IDS for WSN with an optimized prediction model
2) a novel IDS concept using edge computing
3) model that gives the best results for scheduling attack
detection
The remainder of this paper uses the following writing
systematics: Section II discusses related works. Section III
shows the draft of our proposal. Section IV reports the test
results and discusses the results against state-of-the-art papers.
979-8-3503-9660-7/22/$31.00 ©2022 IEEE 1
Finally, Section V emphasizes the important findings of this
study.
II. RELATED WORKS
Several studies have applied IDS to WSN. Sunder et al. [13]
applied the Jensen–Shannon Divergence method to blackhole
attacks on WSN in healthcare and got a detection rate of
up to 97%. Lakshmi et al. [14] simulates a flooding attack
on WSN using ad hoc on-demand distance vector (AODV)
routing and resists the attack using threshold restrictions so
that one node can only send several packets at a time, if
violated, the system blacklists the node. Ye et al. [15] detected
a grayhole attack on WSN using fuzzy logic, and their research
succeeded in increasing the detection accuracy by 4.5 times
for 125 grayhole attacks. Finally, Shahid et al. [16] created
a method called cellular automata energy drainage prevention
(CA-EDP) scheme, which can detect scheduling attacks and
increase WSN lifespan by up to 11%. We analyze that the
research gap is to create an XGBoost method for an IDS
that can detect blackhole, flooding, grayhole, and scheduling
attacks on WSN. Table I compares studies related to our
method.
III. RESEARCH DESIGN
Fig. 1 shows our research methodology. First, we took the
WSN-DS dataset from Kaggle and then observed and analyzed
the dataset. The second step is to train our XGBoost using that
dataset. Third, we compare the performance of our XGBoost
with benchmark methods, namely decision tree and naive
Bayes. Finally, we report the test results.
A. IDS for WSN
We retrieved the WSN-DS dataset from Kaggle by Almo-
mani et al. [17]. The dataset results from a WSN network
with a low energy aware clustering hierarchy (LEACH) routing
protocol. In the WSN, there are 100 nodes, then network
simulator 2 (NS-2) simulates the WSN network for 14 rounds.
These rounds produce up to 7 clusters. Table II summarizes
the simulation specifications.
In LEACH, the cluster head (CH) plays a role in forwarding
data from the WSN node to the base station (BS). Four
types of attacks attack the simulated WSN network: blackhole,
flooding, grayhole, and scheduling attacks. The four attacks at-
tempt to become a fraudulent CH and launch different attacks.
TABLE I: Related Works Comparison on IDS for WSN
Cite IDS for WSN
XGBoost BaGaFaSa
[13] ✗ ✗
[14] ✗ ✗
[15]
[16]
Proposed
Method ✓ ✓
aB = Blackhole, G = Grayhole, F = Flooding,
S = Scheduling.
Fig. 1: The IDS on WSN research methodology.
Grayhole and blackhole are similar attacks. The difference
is that the grayhole drops packets with a certain probability,
which makes grayhole attacks more challenging to detect [18].
Table III summarizes the descriptions of the four attacks.
The WSN-DS dataset has a dataset size of 374,661. Then
there are 21 features. The following is an explanation of each
feature:
1) id: Unique name for each WSN node
2) T ime: Timestamp of the measured data
3) Is CH: Flag indicating whether the WSN node is being
CH or not
4) W ho CH: Node id of the CH of a WSN node
5) RSSI: Signal strength between the WSN node and its
CH
6) Dist to CH: Physical distance between the WSN node
and its CH
7) M D CH: Maximum distance between WSN node and
CH in a cluster
8) A D CH: Average distance of all WSN nodes with CH
in a cluster
9) ADV S: Number of advertised broadcast messages sent
from CH to WSN node
10) ADV R: Number of advertised broadcast messages
received by WSN nodes from CH
11) JOIN S: Number of join request messages sent from
WSN node to CH
12) JOIN R: Number of join request messages received
by a CH from a WSN node
13) ADV SCH S: Number of advertising scheduling mes-
TABLE II: WSN simulation specifications.
No Specification
Parameter Value
1 Routing Protocol LEACH
2 Number of Nodes 100
3 Simulator NS-2
4 Number of Rounds 14
5 Maximum clusters 7
2
TABLE III: WSN attack types
No Attack
Name Explanation
1 Blackhole An attacker impersonates a CH so that other
nodes send data to the CH to be forwarded
to the BS, but by the attacker, the data is not
forwarded
2 Flooding An attacker sends out high-power, high-
volume CH advertisements that cause sensor
nodes to become confused and drain energy
faster than they should
3 Grayhole In comparison to blackholes, grayhole attacks
become CH but drop packets with a certain
probability. This makes it harder to detect
grayhole attacks
4 Scheduling The attacker who becomes the CH performs
scheduling in such a way that all nodes send
data at the same time. This causes packet
collisions and data loss
sages sent by a CH
14) ADV SCH R: The number of advertising scheduling
messages that a WSN node receives from a CH
15) Rank: The order of the WSN nodes in the scheduling
message
16) Data S: Amount of data sent from WSN node to CH
17) Data R: The amount of data received from a WSN
node of a CH
18) Data Sent T o BS: Amount of data sent from WSN
node to BS
19) dist CH T o BS: Physical distance between CH and
BS
20) send code: Send code for a cluster
21) Expaned Energy: Energy consumed by a WSN in the
previous round
Fig. 2 shows our proposed IDS design. The implementation
of IDS is on WSN with the usual LEACH topology, where
there are WSN nodes, CH, and BS. We offer a novel IDS con-
cept, where we synthesize the concept of edge computing in it.
Edge computing moves computing from cloud to edge device.
Here the edge device is WSN node [19]. It makes computing
faster because there is no delay in network communication.
B. Imbalanced Data Problem
Data imbalance is when there is an imbalance between the
number of labels in a dataset [20]. A majority class is a class
whose number far outperforms other classes, called a minority
class. The balance ratio (IR) is the ratio of the number of
majority data to the minority, while the Imbalance degree (ID)
is the ratio of the number of minority data to all data [21].
The IR formula is as follows:
IR =Minority Label Size
Majority Label Size ×100% (1)
Then the ID formula is as follows:
ID =Minority Label Size
Dataset Data Size ×100% (2)
Fig. 2: Proposed IDS design for LEACH-based WSN with an
edge computing concept.
IDs with values of 20% to 40% have a mild imbalance
degree. Then ID with a value of 1% to 20% has a moderate
imbalance degree. Finally, ID < 1% has an extreme imbal-
ance degree.
C. XGBoost Classification Method
We propose an XGBoost model as a detector in IDS for
WSN. XGBoost is a boosting type of ensemble learning
method, which is a method that repeats exercises in which,
in each subsequent iteration, a change is made to the mis-
classified data [22]. The hallmark of ensemble learning is
using several weak learners at once. The boosting method
can enhance these weak learners by reducing the bias [23].
XGBoost is an extension of gradient boosting. If gradient
boosting utilizes gradient descent to reduce the error from
training an iteration in the next iteration, XGBoost instead
uses the Newton-Raphson method to calculate its gradient,
then uses Taylor approximation to calculate the loss [24]. The
calculation of the ˆgm(xi)and hessian ˆ
hm(xi)gradients in the
Newton-Raphson method used in the XGBoost f(xi)model
is as follows:
ˆgm(xi) = δL(yi, f(xi))
δf(xi), i N, m M(3)
3
ˆ
hm(xi) = δ2L(yi, f(xi))
δf(xi)2, i N, m M(4)
where xis input, yis output, Nis dataset size, Mis number
of weak learners, L(yi, f(xi)) is loss function, and f(x) =
ˆ
fm1(x).
Here is the f(x)formula that takes into account ˆgm(xi)and
ˆ
hm(xi):
ˆ
f(m)(x) = α×argmin
θϕ
N
X
i=1
1
2ˆ
hm(xi)"ˆgm(xi)
ˆ
hm(xi)
θ(xi)#2
(5)
where αis the learning rate, ϕis the weak learner. Finally, this
method performs aggregation by adding up all weak learners.
D. Benchmark methods and Performance Metrics
We benchmark our proposed method with decision trees and
naive Bayes. A decision tree is a decision tree formed based on
an algorithm that makes branching based on which feature is
more important [25]. The measurement of feature importance
uses a metric called the Gini index that uses the number of
labels per feature as its measurement [26]. Na¨
ıve Bayes is a
classification method that uses the Bayes theorem [27]. Bayes
theorem looks for the probability of a posterior distribution
based on the likelihood and the prior distribution [28].
The classification symptom in the imbalanced dataset is
an unequal measurement between the majority class and the
minority class. To measure the performance of each class, we
use P recision and Recall. Here is the formula:
P recision =T P
T P +F P (6)
Recall =T P
T P +F N (7)
Performance measurement on imbalance dataset must use
scale invariant metric. Otherwise, majority label data will
affect minority label measurement. Here we use ROC and
AUC. ROC is a curve that explains the relationship between
the true positive rate (TPR) and false positive rate (FPR) of
the classification that produces the probability of a decision,
not an absolute decision [29]. AUC is the area under the
ROC curve, a metric to objectively measure ROC. The AUC
calculation can use the trapezoidal integral of the ROC, which
is as follows:
AUC =
N1
X
n=1
TPRn+TPRn+1
2×(FPRn+1 FPRn)(8)
where Nis the sum of the measurements TPR and FPR.
TABLE IV: WSN simulation specifications.
No Imbalance Test Results
Class Type IR ID Degree
1 Normal Majority − −
2 Grayhole Minority 4.29% 3.90% Moderate
3 Blackhole Minority 2.96% 2.68% Moderate
4 Scheduling Minority 1.95% 1.77% Moderate
5 Flooding Minority 0.97 0.88% Extreme
IV. RESULTS AND DISCUSSION
A. Results
Among the WSN-DS dataset, the number of normal, gray-
hole, blackhole, scheduling, and flooding classes are 340066,
14596, 10049, 6638, and 3312, respectively. The table IV
shows the results of the IR and ID tests. The majority of
the class is the Normal class. The dataset with the highest IR
is the grayhole class. The lowest is flooding. Three datasets
classified as moderate imbalance are grayhole, blackhole, and
scheduling. Flooding is a class with extreme imbalance.
We optimized XGBoost before comparing it with bench-
mark methods. The disadvantage of XGBoost compared to
decision trees is the algorithm’s complexity. The algorithm
gets increasingly complex with more used weak learners [30].
So we look for optimization between the minimum number
of weak learners and the best precision-recall. Fig. 3 shows
the test results. The test shows that XGBoost has linear time
complexity concerning the number of weak learners or O(M).
We take the optimum Mas 60.
We compare the optimal XGBoost performance with the
benchmark models: decision tree and naive Bayes. Because the
dataset is imbalanced, it is important to measure the precision
and recall of each class. Fig. 4 shows the comparison of the
precision of each class based on the three detection methods
implemented. XGBoost has the highest precision in each class,
while in the flooding class, the decision tree has a precision
of 0, meaning that none of the flooding classes detected are
from the actual flooding class.
Fig. 3: The influence of the number of weak learners on
execution time, precision, and recall.
4
1 / 8 100%
La catégorie de ce document est-elle correcte?
Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans l'interface ou les textes ? Ou savez-vous comment améliorer l'interface utilisateur de StudyLib ? N'hésitez pas à envoyer vos suggestions. C'est très important pour nous!