XGBoost for IDS on WSN Cyber Attacks

Telechargé par maryam SALAMI

Téléchargement

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/366433530

XGBoost for IDS on WSN Cyber Attacks with Imbalanced Data

Conference Paper · November 2022

DOI: 10.1109/ISESD56103.2022.9980630

CITATIONS

READS

107

4 authors, including:

Aji Gautama Putrada

Telkom University

135 PUBLICATIONS897 CITATIONS

SEE PROFILE

Syafrial Fachri Pane

Politeknik Pos Indonesia

56 PUBLICATIONS315 CITATIONS

SEE PROFILE

Mohamad Nurkamal Fauzan

Universitas Logistik dan Bisnis Internasional

54 PUBLICATIONS154 CITATIONS

SEE PROFILE

All content following this page was uploaded by Aji Gautama Putrada on 24 March 2024.

The user has requested enhancement of the downloaded file.

XGBoost for IDS on WSN Cyber Attacks with

Imbalanced Data

Aji Gautama Putrada

Advanced and Creative Networks

Research Center

Telkom University

Bandung, Indonesia

ajigps@telkomuniversity.ac.id

Nur Alamsyah

Advanced and Creative Networks

Research Center

Telkom University

Bandung, Indonesia

[email protected]versity.ac.id

Syafrial Fachri Pane

Advanced and Creative Networks

Research Center

Telkom University

Bandung, Indonesia

[email protected]versity.ac.id

Mohamad Nurkamal Fauzan

Advanced and Creative Networks

Research Center

Telkom University

Bandung, Indonesia

[email protected]versity.ac.id

Abstract—A wireless sensor network (WSN) is also vulnerable

to cyber-attacks, just other systems connected to the computer

network, which makes the intrusion detection system (IDS) for

WSN an interesting research study. However, IDS datasets are

usually associated with imbalanced data because attacks usually

occur in low frequency. This study proposes the application

of XGBoost in IDS on WSN cyber attacks that experience

imbalanced data. We obtained the attack dataset on WSN

from Kaggle, which data on blackhole, grayhole, ﬂooding, and

scheduling attacks. We use decision trees and naive Bayes to

benchmark the performance of our proposed method. Then the

precision, recall, receiver operating curve (ROC), and area under

curve (AUC) value is to evaluate our IDS model. The test results

show that the three classes have moderate imbalanced data, while

one class, the ﬂooding attack class, has severe imbalanced data.

Compared to the two benchmark methods, decision tree and

naive Bayes, XGBoost has the best AUC for scheduling, normal,

grayhole, ﬂooding, and blackhole classes with values of 0.987,

0.9963, 0.9994, 0.9997, and 0.9999 respectively.

Index Terms—intrusion detection system, wireless sensor net-

work, extreme gradient boosting, data imbalance

I. INTRODUCTION

Wireless sensor networks (WSN) is an emerging topic

which, as the name suggests, is a sensor that is spread out and

connected to a computer network to monitor certain values in

its implementation environment [1]. WSN research is about

optimization of network topology [2], optimization of cluster

head selection [3], and optimization of routing [4]. WSN

application areas are around agriculture [5], gas, and ﬁre

detection [6]. Because it connects to the computer network,

WSN is also vulnerable to cyber attacks, so the intrusion

detection system (IDS) for WSN is also a concern [7].

Thank you to the Directorate of Research and Community Service (PPM)

Telkom University for funding this research.

IDS can use several machine learning methods as detection

methods in WSN. Gite et al. [8] implements a decision tree on

WSN to detect blackhole, wormhole, grayhole, and distributed

denial of service (DDoS) attacks with an accuracy of 70%.

Mehmood et al. [9] made an IDS to detect DDoS ﬂooding on

WSN with na¨

ıve Bayes. However, the IDS dataset is usually

associated with imbalanced data because attacks usually occur

in a low-frequency [10].

Several studies use extreme gradient boosting (XGBoost)

for the detection method on imbalanced data [11]. Qiu et

al. [12] applies XGBoost to credit card fraud detection and

shows that XGBoost is better than other methods of detecting

imbalanced data. Applying XGBoost in imbalanced data on

IDS for WSN is a research opportunity.

This study proposes the application of XGBoost in IDS

on WSN cyber attacks that experience imbalanced data. We

obtain the attack dataset on WSN from Kaggle, which contains

data on blackhole, grayhole, ﬂooding, and scheduling attacks.

We use decision trees and naive Bayes to benchmark the

performance of our proposed method. We use precision, recall,

receiver operating curve (ROC), and the area under curve

(AUC) value to evaluate our IDS model.

To the best of our knowledge, there has never been a study

that has applied XGBoost for IDS on WSN cyber attacks that

have imbalanced data. Here are our research contributions:

1) a fast IDS for WSN with an optimized prediction model

2) a novel IDS concept using edge computing

3) model that gives the best results for scheduling attack

detection

The remainder of this paper uses the following writing

systematics: Section II discusses related works. Section III

shows the draft of our proposal. Section IV reports the test

results and discusses the results against state-of-the-art papers.

Finally, Section V emphasizes the important ﬁndings of this

study.

II. RELATED WORKS

Several studies have applied IDS to WSN. Sunder et al. [13]

applied the Jensen–Shannon Divergence method to blackhole

attacks on WSN in healthcare and got a detection rate of

up to 97%. Lakshmi et al. [14] simulates a ﬂooding attack

on WSN using ad hoc on-demand distance vector (AODV)

routing and resists the attack using threshold restrictions so

that one node can only send several packets at a time, if

violated, the system blacklists the node. Ye et al. [15] detected

a grayhole attack on WSN using fuzzy logic, and their research

succeeded in increasing the detection accuracy by 4.5 times

for 125 grayhole attacks. Finally, Shahid et al. [16] created

a method called cellular automata energy drainage prevention

(CA-EDP) scheme, which can detect scheduling attacks and

increase WSN lifespan by up to 11%. We analyze that the

research gap is to create an XGBoost method for an IDS

that can detect blackhole, ﬂooding, grayhole, and scheduling

attacks on WSN. Table I compares studies related to our

method.

III. RESEARCH DESIGN

Fig. 1 shows our research methodology. First, we took the

WSN-DS dataset from Kaggle and then observed and analyzed

the dataset. The second step is to train our XGBoost using that

dataset. Third, we compare the performance of our XGBoost

with benchmark methods, namely decision tree and naive

Bayes. Finally, we report the test results.

A. IDS for WSN

We retrieved the WSN-DS dataset from Kaggle by Almo-

mani et al. [17]. The dataset results from a WSN network

with a low energy aware clustering hierarchy (LEACH) routing

protocol. In the WSN, there are 100 nodes, then network

simulator 2 (NS-2) simulates the WSN network for 14 rounds.

These rounds produce up to 7 clusters. Table II summarizes

the simulation speciﬁcations.

In LEACH, the cluster head (CH) plays a role in forwarding

data from the WSN node to the base station (BS). Four

types of attacks attack the simulated WSN network: blackhole,

ﬂooding, grayhole, and scheduling attacks. The four attacks at-

tempt to become a fraudulent CH and launch different attacks.

TABLE I: Related Works Comparison on IDS for WSN

Cite IDS for WSN

XGBoost BaGaFaSa

[13] ✗✓✗ ✗ ✗

[14] ✗ ✗ ✓✗ ✗

[15] ✗ ✗ ✗ ✓✗

[16] ✗ ✗ ✗ ✗ ✓

Proposed

Method ✓ ✓ ✓ ✓ ✓

aB = Blackhole, G = Grayhole, F = Flooding,

S = Scheduling.

Fig. 1: The IDS on WSN research methodology.

Grayhole and blackhole are similar attacks. The difference

is that the grayhole drops packets with a certain probability,

which makes grayhole attacks more challenging to detect [18].

Table III summarizes the descriptions of the four attacks.

The WSN-DS dataset has a dataset size of 374,661. Then

there are 21 features. The following is an explanation of each

feature:

1) id: Unique name for each WSN node

2) T ime: Timestamp of the measured data

3) Is CH: Flag indicating whether the WSN node is being

CH or not

4) W ho CH: Node id of the CH of a WSN node

5) RSSI: Signal strength between the WSN node and its

6) Dist to CH: Physical distance between the WSN node

and its CH

7) M D CH: Maximum distance between WSN node and

CH in a cluster

8) A D CH: Average distance of all WSN nodes with CH

in a cluster

9) ADV S: Number of advertised broadcast messages sent

from CH to WSN node

10) ADV R: Number of advertised broadcast messages

received by WSN nodes from CH

11) JOIN S: Number of join request messages sent from

WSN node to CH

12) JOIN R: Number of join request messages received

by a CH from a WSN node

13) ADV SCH S: Number of advertising scheduling mes-

TABLE II: WSN simulation speciﬁcations.

No Speciﬁcation

Parameter Value

1 Routing Protocol LEACH

2 Number of Nodes 100

3 Simulator NS-2

4 Number of Rounds 14

5 Maximum clusters 7

TABLE III: WSN attack types

No Attack

Name Explanation

1 Blackhole An attacker impersonates a CH so that other

nodes send data to the CH to be forwarded

to the BS, but by the attacker, the data is not

forwarded

2 Flooding An attacker sends out high-power, high-

volume CH advertisements that cause sensor

nodes to become confused and drain energy

faster than they should

3 Grayhole In comparison to blackholes, grayhole attacks

become CH but drop packets with a certain

probability. This makes it harder to detect

grayhole attacks

4 Scheduling The attacker who becomes the CH performs

scheduling in such a way that all nodes send

data at the same time. This causes packet

collisions and data loss

sages sent by a CH

14) ADV SCH R: The number of advertising scheduling

messages that a WSN node receives from a CH

15) Rank: The order of the WSN nodes in the scheduling

message

16) Data S: Amount of data sent from WSN node to CH

17) Data R: The amount of data received from a WSN

node of a CH

18) Data Sent T o BS: Amount of data sent from WSN

node to BS

19) dist CH T o BS: Physical distance between CH and

20) send code: Send code for a cluster

21) Expaned Energy: Energy consumed by a WSN in the

previous round

Fig. 2 shows our proposed IDS design. The implementation

of IDS is on WSN with the usual LEACH topology, where

there are WSN nodes, CH, and BS. We offer a novel IDS con-

cept, where we synthesize the concept of edge computing in it.

Edge computing moves computing from cloud to edge device.

Here the edge device is WSN node [19]. It makes computing

faster because there is no delay in network communication.

B. Imbalanced Data Problem

Data imbalance is when there is an imbalance between the

number of labels in a dataset [20]. A majority class is a class

whose number far outperforms other classes, called a minority

class. The balance ratio (IR) is the ratio of the number of

majority data to the minority, while the Imbalance degree (ID)

is the ratio of the number of minority data to all data [21].

The IR formula is as follows:

IR =Minority Label Size

Majority Label Size ×100% (1)

Then the ID formula is as follows:

ID =Minority Label Size

Dataset Data Size ×100% (2)

Fig. 2: Proposed IDS design for LEACH-based WSN with an

edge computing concept.

IDs with values of 20% to 40% have a mild imbalance

degree. Then ID with a value of 1% to 20% has a moderate

imbalance degree. Finally, ID < 1% has an extreme imbal-

ance degree.

C. XGBoost Classiﬁcation Method

We propose an XGBoost model as a detector in IDS for

WSN. XGBoost is a boosting type of ensemble learning

method, which is a method that repeats exercises in which,

in each subsequent iteration, a change is made to the mis-

classiﬁed data [22]. The hallmark of ensemble learning is

using several weak learners at once. The boosting method

can enhance these weak learners by reducing the bias [23].

XGBoost is an extension of gradient boosting. If gradient

boosting utilizes gradient descent to reduce the error from

training an iteration in the next iteration, XGBoost instead

uses the Newton-Raphson method to calculate its gradient,

then uses Taylor approximation to calculate the loss [24]. The

calculation of the ˆgm(xi)and hessian ˆ

hm(xi)gradients in the

Newton-Raphson method used in the XGBoost f(xi)model

is as follows:

ˆgm(xi) = δL(yi, f(xi))

δf(xi), i ∈N, m ∈M(3)

hm(xi) = δ2L(yi, f(xi))

δf(xi)2, i ∈N, m ∈M(4)

where xis input, yis output, Nis dataset size, Mis number

of weak learners, L(yi, f(xi)) is loss function, and f(x) =

fm−1(x).

Here is the f(x)formula that takes into account ˆgm(xi)and

hm(xi):

f(m)(x) = α×argmin

θ∈ϕ

i=1

2ˆ

hm(xi)"−ˆgm(xi)

hm(xi)

−θ(xi)#2

(5)

where αis the learning rate, ϕis the weak learner. Finally, this

method performs aggregation by adding up all weak learners.

D. Benchmark methods and Performance Metrics

We benchmark our proposed method with decision trees and

naive Bayes. A decision tree is a decision tree formed based on

an algorithm that makes branching based on which feature is

more important [25]. The measurement of feature importance

uses a metric called the Gini index that uses the number of

labels per feature as its measurement [26]. Na¨

ıve Bayes is a

classiﬁcation method that uses the Bayes theorem [27]. Bayes

theorem looks for the probability of a posterior distribution

based on the likelihood and the prior distribution [28].

The classiﬁcation symptom in the imbalanced dataset is

an unequal measurement between the majority class and the

minority class. To measure the performance of each class, we

use P recision and Recall. Here is the formula:

P recision =T P

T P +F P (6)

Recall =T P

T P +F N (7)

Performance measurement on imbalance dataset must use

scale invariant metric. Otherwise, majority label data will

affect minority label measurement. Here we use ROC and

AUC. ROC is a curve that explains the relationship between

the true positive rate (TPR) and false positive rate (FPR) of

the classiﬁcation that produces the probability of a decision,

not an absolute decision [29]. AUC is the area under the

ROC curve, a metric to objectively measure ROC. The AUC

calculation can use the trapezoidal integral of the ROC, which

is as follows:

AUC =

N−1

n=1

TPRn+TPRn+1

2×(FPRn+1 −FPRn)(8)

where Nis the sum of the measurements TPR and FPR.

TABLE IV: WSN simulation speciﬁcations.

No Imbalance Test Results

Class Type IR ID Degree

1 Normal Majority − − −

2 Grayhole Minority 4.29% 3.90% Moderate

3 Blackhole Minority 2.96% 2.68% Moderate

4 Scheduling Minority 1.95% 1.77% Moderate

5 Flooding Minority 0.97 0.88% Extreme

IV. RESULTS AND DISCUSSION

A. Results

Among the WSN-DS dataset, the number of normal, gray-

hole, blackhole, scheduling, and ﬂooding classes are 340066,

14596, 10049, 6638, and 3312, respectively. The table IV

shows the results of the IR and ID tests. The majority of

the class is the Normal class. The dataset with the highest IR

is the grayhole class. The lowest is ﬂooding. Three datasets

classiﬁed as moderate imbalance are grayhole, blackhole, and

scheduling. Flooding is a class with extreme imbalance.

We optimized XGBoost before comparing it with bench-

mark methods. The disadvantage of XGBoost compared to

decision trees is the algorithm’s complexity. The algorithm

gets increasingly complex with more used weak learners [30].

So we look for optimization between the minimum number

of weak learners and the best precision-recall. Fig. 3 shows

the test results. The test shows that XGBoost has linear time

complexity concerning the number of weak learners or O(M).

We take the optimum Mas 60.

We compare the optimal XGBoost performance with the

benchmark models: decision tree and naive Bayes. Because the

dataset is imbalanced, it is important to measure the precision

and recall of each class. Fig. 4 shows the comparison of the

precision of each class based on the three detection methods

implemented. XGBoost has the highest precision in each class,

while in the ﬂooding class, the decision tree has a precision

of 0, meaning that none of the ﬂooding classes detected are

from the actual ﬂooding class.

Fig. 3: The inﬂuence of the number of weak learners on

execution time, precision, and recall.

1 / 8 100%

Documents connexes

Fake Id

Fake ids

Increasing Popularity Of The Fake And Real Resident Permit

Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans l'interface ou les textes ? Ou savez-vous comment améliorer l'interface utilisateur de StudyLib ? N'hésitez pas à envoyer vos suggestions. C'est très important pour nous!

GDPR Confidentialité Conditions d'utilisation

XGBoost for IDS on WSN Cyber Attacks

Documents connexes

Faire une suggestion

Produits

Assistance

Produits

Assistance

XGBoost for IDS on WSN Cyber Attacks

Documents connexes

Faire une suggestion

Produits

Assistance

Ajouter ce document à la (aux) collections

Ajouter ce document à enregistré

Suggérez-nous comment améliorer StudyLib