See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220063716 On decision making for dynamic configuration adaptation problem in cognitive radio equipments: a multi-armed bandit based approach Article · March 2010 CITATION READS 1 63 3 authors: Wassim Jouini Christophe Moy École Supérieure d'Electricité Université de Rennes 1 30 PUBLICATIONS 261 CITATIONS 236 PUBLICATIONS 1,172 CITATIONS SEE PROFILE Jacques Palicot École Supérieure d'Electricité 369 PUBLICATIONS 2,490 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: ESL Application-level Modeling View project ANR MOPCOM View project All content following this page was uploaded by Christophe Moy on 26 June 2014. The user has requested enhancement of the downloaded file. SEE PROFILE On decision making for dynamic configuration adaptation problem in cognitive radio equipments: a multi-armed bandit based approach. Wassim Jouini, Christophe Moy, Jacques Palicot, SUPELEC/IETR, France {wassim.jouini, christophe.moy, jacques.palicot}@supelec.fr set of parameters to operate with great flexibility and efficiency (e.g., change the bandwidth of the devices, switch from one communication protocol to another, minimize the energy consumption of a device, etc.). Soon after the emergence of the SDR field, several scientists have studied ways to control at best these parameters leading to the emergence of a new research field, named Cognitive Radio [1]. The concept of Cognitive Radio (CR) presents itself as the technology that will have the autonomy and the cognitive abilities to become aware of its environment as well as of its own operational abilities. The purpose of this new concept is to meet the user’s expectations, i.e., maximizing his profit without compromising the efficiency of the network. Thus, it presupposes the capacity to collect information from its surrounding environment (perception), to digest it (learning, decision making and predicting problems) and to act in the best possible way by considering several constraints and the available information. Therefore, it is a new paradigm of wireless communication whose purpose is to combine Software Defined Radio technologies and Cognitive Abilities in order to achieve Cognitive Radio equipments. Sensing [2][3] and reconfiguration [4][5] have been quite intensively investigated in the community and are out of the scope of this paper. However, on the decision making side, only a few methods were suggested by the community and most of them are still in their infancy. Eventually, the promises of this new technology are as high as the challenges it sets. The purpose of this paper is twofold: On the one hand, we aim at presenting a quick survey on the several decision making challenges the CR community Abstract— We introduce in this paper the notion of “design space” as a conceptual object that defines a set of cognitive radio decision making problems by their constraints rather than by their degrees of freedom. We identified, in our analysis work, three dimensions of constrains: the environment’s, the equipment’s and the user’s related constrains. Moreover , we define and use the notion of a priori knowledge, to show that the tackled challenges by the radio community to solve configuration adaptation decision making problems have often the same design space, however they differ by the a priori knowledge they assume available. Consequently, we suggest in this paper, the “a priori knowledge” as a classification criteria to discriminate the main proposed techniques in the literature to solve configuration adaptation decision making problems. In the rest of the paper we propose to further study a particular decision making framework where no a priori (or limited) information is provided to the cognitive radio equipment. An approach based on tools borrowed from the multi-armed bandit community is discussed. Finally, our simulation results highlight that by customizing algorithms developed for solving the multi-armed bandit, efficient engineering solutions to some problems met in cognitive radio can indeed be built. Index Terms— Cognitive radio, decision making problems, dynamic configuration adaptation, multiarmed bandit, Upper Confidence Bound, design space, a priori knowledge, survey. I. I NTRODUCTION Recent hardware advances have offered the possibility to design software solutions to problems which were requiring in the past hardwired signal processing devices. With this added software layer, equipments based on this technology, referred to as software defined radios (SDR), are able to control a large 1 has been dealing with during this last 10 years, as well as the main solutions and tools suggested by the CR literature to deal with these challenges. This survey focuses on CR equipments’ based decision making and learning challenges. On the other hand, we complete this survey by tackling a particular online decision making issue where the CR equipment operates in an unknown environment [6]. The outline of the rest of this paper is the following: we start by introducing and defining a conceptual object referred to as design space in Section II. The main purpose of this object is to suggest that the cognitive radio design problem is defined by a set of constrains rather than by its degrees of freedom. We identified, in our analysis work, three dimensions of constrains: the environment’s, the equipment’s and the user’s related constrains. Moreover, in Section III, we define and use the notion of a priori knowledge, to show that the tackled challenges by the radio community to solve configuration adaptation decision making problems have often the same design space, however they differ by the a priori knowledge they assume available. Consequently, in section III, we suggest the “a priori knowledge” as a classification criteria to discriminate the main proposed techniques in the literature to solve configuration adaptation decision making problems. In Section IV, we further detail one particular decision making tool borrowed from the machine learning community. We suggest to use it in a cognitive radio context when dealing with environments where almost no a priori knowledge is available and where the performance evaluation is uncertain. Section V presents several simulations to validate our approach on an academic dynamic configuration adaptation problem. Finally, Section VI concludes. Fig. 1. Cognitive radio decision making context. When designing such CR equipments the main challenge is to find an appropriate way to correctly dimension its cognitive abilities according to its environment as well as to its purpose (i.e., providing a certain service to the user). Several papers in the literature have already been concerned by this matter however their description of the problem usually remained fuzzy (e.g., [1][8][9]). We summerize their analysis by defining three “constraints” on which the design of a CR equipment will depend: First, the constraints imposed by the surrounding environment, then the constraints related to the user’s expectations and finally, the constraints inherent to the equipment. These constraints help dimensioning the CR decision making engine. Consequently, an a priori formulation of these elements helps the designer to implement the right tools in order to obtain a flexible and adequate cognitive radio. 1) The environment constraints: since a cognitive radio is a wireless device that operates in a surrounding communicating environment, it shall respect its rules (e.g., allocated frequency bands, tolerated interference,etc.). Thus the behavior of cognitive radio equipments is highly coordinated by the constraints imposed by the environment. As a matter of fact, if the environment allows no degrees of freedom to the equipments, this latter has no choice but to obey and thus looses all cognitive behavior. On the other side, if no constraints are imposed by the environment, the cognitive radio will still be constrained by its own operational abilities and the expectations of the user. 2) User’s expectations: when using his wireless device for a particular application (voice communication, data, streaming and so on), the user is expecting a certain quality of service. Depending on the awaited quality of service, the cognitive radio can identify several criteria to optimize, such as, minimizing the bit error rate, minimizing energy consump- II. COGNITIVE RADIO DESIGN SPACE A. Cognitive radio design related constraints A Cognitive Radio (CR) equipment can be defined as a communication system aware of its environment as well as of its operational abilities and capable of using them intelligently. Consequently it is assumed that the device has the ability to collect information through its sensors and that it can use that information to adapt itself to its surrounding environment as described in Figure 1. That presupposes cognitive abilities enabling CR equipments to deal with all the collected information in order to make appropriate decisions [1][7]. 2 In Figure 2, we represent two sub-spaces referred to as actual design space and virtual design space. On the one hand, the virtual design space refers to the upper bound support of the design space where every dimension is considered independently from the others. Its volume can be interpreted as the largest space of decision problems one could define from the three dimensions. On the other hand, the actual design space is included in the virtual design space. It results from the reduction of the design space when taking into account the correlation between the different constraints imposed by every dimension of the design space. For instance, some constraints on the environment such as, “imposed fixed waveform” might disable some objectives such as “find a waveform that maximizes the spectral efficiency”. Fig. 2. Cognitive radio decision making design space. tion, maximizing spectral efficiency, etc. If the user is too greedy and imposes too many objectives, the designing problem to solve might become intractable because of the constraints imposed by the surrounding environment and the platform of the cognitive radio. However if the user is expecting nothing, then again there is no need for a flexible cognitive radio. Usually it is assumed that the user is reasonable in a sense that he will accept the best he could get with a minimum cost as long as the quality of service provided is above a certain level. 3) Equipment’s operational abilities: These limitations are perhaps the most obvious since one cannot ask the cognitive radio equipment to adapt itself more than what it can perform (sense and/or act). It is usually assumed in the cognitive radio literature that the equipment is an ideal software defined radio, and thus, that it has all the needed flexibility for the designed framework. On a real application the efficiency of cognitive radio equipments depends of course on the degrees of freedom (or equivalently the constraints) inherent to the wireless platform used to communicate. As examples of commonly analyzed degrees of freedom one can find: modulation, pulse shape, symbol rate, transmit power, etc. C. Dynamic Configuration Adaptation-DCA As an illustrative exemple that we will use for the rest of the paper, we define the design space of the so called dynamic configuration adaptation (DCA) problem. Within this framework, we assume that the environment constrains the cognitive radio by allowing only K possible configurations to use. This condition characterizes the environment and the equipment. Moreover we assume that there exist M ≥ 1 objectives that evaluates how well the equipment performs to meet the users expectations. To conclude, we usually observe in the literature that these characterizations are implicitly made, then final assumptions are done to define the decision making framework. These assumptions concern what we refer to as the “a priori model knowledge”. In the next section, we introduce and explain the notion of a priori knowledge and we present a brief state of the art on decision making for cognitive radio configuration adaptation using the particular DCA design space. We show that although the design space is the same, depending on the a priori model knowledge, different approaches are suggested by the community to tackle the defined decision making problems. B. Design space We denote by cognitive radio design space an abstract three dimensional space that characterizes the CR decision making engine as shown in Figure 2. It is indeed abstract since it does not have any rigorous mathematical meaning but it is only used to visually and conceptually illustrate the dependencies of the CR decision making engine to the ”design dimensions”: environment, parameters (usually referred to as knobs) and objectives (or criteria defined from the user’s expectations). III. DYNAMIC CONFIGURATION ADAPTATION PROBLEM : CHALLENGES AND SUGGESTED APPROACHES The a priori knowledge is a set of assumptions made by the designer on the amount and representation of the available information to the decision making engine when it first deals with the environment. As a matter of fact, “knowledge” is defined by the Oxford English Dictionary as: (i) expertise, and skills 3 approach had a large success especially due to the XG project (neXt Generation) supported by the DARPA (e.g. [10] and for spectrum sharing: [11]). As a matter of fact, if the knowledge is well represented and provided to the equipment as a set of rules, the decision making process becomes very simple. However this approach has a few drawbacks: • The behavior of the designed system is not adapted to a particular user but to all users and to a set of probable environments. Moreover in order to acquaint the CR decision making engine with valuable and large knowledge, an important amount of effort is needed from the designer. • Expert knowledge is mainly based on models. Thus the system might behave in a poor way when it is facing unexpected dynamic in the environment. The techniques based on expert systems can, however be supported by several other tools to help them acquire new knowledge on the environment or help them avoid conflicts between different configuration adaptation rules. acquired by a person through experience or education; the theoretical or practical understanding of a subject, (ii) what is known in a particular field or in total; facts and information or (iii) awareness or familiarity gained by experience of a fact or situation. Consequently, within the cognitive radio framework, we can define the a priori knowledge as the set of theoretical or practical assumptions provided by the designer to the CR decision making engine. These assumptions, if they are accurate, provide the CR with valuable information on the problem to deal with. These remarks lead us to suggest that the decision making problems the cognitive radio will have to deal with are defined by the set {design space, a priori knowledge}. The more accurate the a priori knowledge is the more efficient the cognitive radio can be. In the next subsections we briefly describe the different approaches provided by the community depending on the a priori knowledge assumed relevant to tackle the environment the CR might face during its life time. In Figure 3 we see a suggestion to classify these techniques depending on the a prioi knowledge provided to the cognitive decision making engine. A. Expert approach B. Exploration based decision making: Genetic Algorithms The expert approach relies on the important amount of knowledge collected by telecommunication engineers and researchers. This knowledge is based on theoretical consideration and practical measures on the environment and radio communication parameters. It was first suggested by Mitola in his Ph.D. dissertation on cognitive radio [1]. Through intensive off-line simulations, expert systems are provided with a set of inference rules. These rules are then used on-line to adapt the equipment depending on the context faced by cognitive radio equipments. Thus the more available knowledge the better the equipment can adapt itself to its surrounding dynamic environment. However, this knowledge is usefully as long as if the cognitive radio can represent its knowledge in a way that enables to exploit it and to react to the environment by adequate adaptations of its operating configuration. For that purpose, Mitola suggested representing the knowledge of cognitive radio equipments using a new dedicated language radio communication: “Radio Knowledge Representation Language” (RKRL)[1]. This representation of knowledge uses web semantic such as XML (eXtensible Markup Language), EDF (Resource Description Framework) and OWL (Web Ontology Language). The expert knowledge based In some contexts, one can consider that there is a priori knowledge available on the complex relationships existing between, the metrics observed, the parameters to adapt and the criteria to satisfy as described in Figure 4. In this case the problem appears to be a multi-criteria optimization problem. Within this framework, the CR decision making engine aims at finding the best parameters to meet the users expectations by solving a set of equations as shown in Table II, Figure 4). This problem is known to be complex for several reasons: • there exist no universal definition of optimality in this case. Thus the solution of this problem are satisfactory (or not) with respect to a certain function, usually named fitness that evaluates how well the criteria were satisfied. • Thus usually a large space of possible “good” configurations can be available. • The criteria are correlated and can be in conflict (e.g., Figure 4). If we assume that the previously mentioned off-line expert rule extraction phase has not been (or partially) accomplished an exploration of the space of possible configurations is needed. This defined cognitive radio decision making framework was first analyzed by Christian James 4 Fig. 3. Suggested decision making techniques depending on the assumed a priori knowledge. Fig. 4. Multi-criteria optimization problem [12]. learning tools aim at representing the functional relationship between the environment (through the sensed metrics), the systems parameters and the criteria to satisfy, they need a direct interaction with the environment in order to build a posteriori knowledge on their environment. In this paper we sub classify these methods depending on the way they learn and exploit their rules. On the on hand (i), we find a set of techniques that separates exploration and exploitation phases. On the other hand (ii), we find other techniques more flexible that combine both processes. In the first mentioned case (i) we find several tools such as Artificial Neural Networks or statistical learning already used and exploited in other domain requiring some cognitive abilities (robotics, video games, etc.). These methods have two phases: a phase of pure “exploration” where the CR decision making engine learns and infers to find (explicitly or implicitly) decision making rules, then uses in a second phase this a posteriori knowledge to make decision. Since these learning techniques rely on a first learning phase, a large amount of data and computational power is needed in order to extract reliable knowledge. This difficulty is already known concerning ANN for instance. It is still true for statistical learning. As noticed by Weingart in his paper [15], the provided techniques are still computationally prohibitive, and not ready yet to be used in a real equipment. However if the first phase is well achieved the second phase is usually very simple and doesn’t require much time or Rieser and Thomas W. Rondeau. They suggested the use of Genetic Algorithms (GA) to tackle this framework [8][12]. Genetic algorithms were first designed to mimic Darwin’s evolutionary theory and are well known for their capacity to adapt themselves to a changing environment. Their work showed that under this design space and with the described a priori knowledge, the genetic algorithms provide cognitive radios with an efficient and flexible decision making engine. C. Learning approaches: exploration and exploitation As we argued in the previous subsections and as several other authors [13][9] notices, “Many CR proposals, such as [12][13][14], rely on a priori characterization of these performance metrics which are often derived from analytical models. Unfortunately, [. . . ], this approach is not always practical due to e.g., limiting modeling assumption, non-ideal behaviors in real-life scenarios, and poor scalability” [13]. To avoid these limitations and in order to tackle more realistic scenarios, many methods based on learning techniques were suggested: Artificial Neuronal Networks (ANN), Evolving connectionist systems (ECS), statistical learning, regression models and so on. All of these approaches have their cons and pros, however they all have in common that they mainly rely on the real environment to try and infer from it decision making rules for CR equipments. Since this 5 energy [13]. in the second case (ii), we find promising techniques recently introduced to the community and still need to be further investigated [9][6]. These techniques try to provide the CR with a flexible and incremental learning decision making engine. In the case of ECS based decision making engine, Colson suggested the use of an evolving neural network [16][17]. Unlike the usual ANN, the ECS-NN can change its structure without “forgetting” already learned knowledge. Thus new rules can be learned by adding new neurons to the neural structure. In order to be efficient the architecture proposed in [9] needs some expert advice (a priori knowledge) on the several available configurations. These added information ranks the different configurations based on some criteria (robustness, spectral efficiency, etc.) but without knowing a priori which one is more adequate when facing a certain environment. The suggested tools in [6] however assumes that no a priori knowledge is provided and that the performance of the equipment can only be estimated when trying a specific configuration. These tools are based on the so-called Multi-Armed Bandit (MAB) framework and will be further detailed in Section IV. Fig. 5. Slot representation for a radio equipment controlled by a cognitive decision making engine. A slot is divided into 4 periods. During the first period, the cognitive decision making engine senses the environment and chooses the next configuration. If the new configuration is different from the current one, a reconfiguration is carried out during the second period before communicating. If a reconfiguration is not needed, the CR equipment keeps the current configuration to communicate. At the end of every slot, the cognitive decision making engine computes a reward that evaluates its performance during the communication process. It is assumed here that τ1 + τ2 + τ4 are small with respect to τ3 . IV. DYNAMIC CONFIGURATION ADAPTATION PROBLEM A. General Framework The general framework tackled in this section is described in Figure 6. A particular case of this problem has been introduced to the CR community in a previous paper [6]. In this section we extend the framework to a more realistic scenario. Within this framework (as for the previously analyzed one in [6]) the problem appears as a particular instance of the well know multi-armed bandit problem. A multi-armed bandit is a simple machine learning problem based on an analogy with the traditional slot machine (one armed bandit) but with more than one lever. When pulled at a time t = 0, 1, 2, ..., each lever (or machine) k ∈ {k = 1, ..., K} provides a reward rt drawn from a distribution θk associated to that specific lever. The objective of the gambler is to maximize the collected reward sum through iterative pulls. It is classically assumed that the gambler has no initial knowledge about the levers. However it is important to understand that many CR applications may provide some information that shall be used to design better policies. For the sake of generality, and in order to cope with the worst situations, we ignore on purpose some of that information. The crucial tradeoff the gambler faces at each trial is between “exploitation” of the lever that has the highest expected payoff and “exploration” to get more information about the expected payoffs of the other levers. In this paper we assume that the different payoffs drawn from a machine are independent and identically distributed (i.i.d.) and that the independence of the rewards holds between the machines. However the different machines reward distributions {θ1 , θ2 , ..., θK } are not supposed to be the same. We invite the reader to refer to the previ- To conclude on this first part of the paper, we would like to enhance the fact that the proposed classification in this paper shows that a CR equipment cannot depend on only one core decision making tool but on a pool of techniques. Everytime it faces an environment, the equipment needs to have an estimation of its a priori knowledge and on its reliability. To tackle a particular context, the general process can be summarized through three questions: What can’t I do (design space)? What do I already know (a priori knowledge)? And what technique should I select to solve the decision making problem? In the next section we further detail a particular case of partial monitoring under uncertainty known as multi-armed bandit framework. Within this framework we assume that we only have very limited a priori knowledge on the environment and on the CR itself, which makes senses within a CR framework. The purpose of the method suggested in this section in to offer a balance between exploration and exploitation phase without interrupting the communication process, i.e., while providing a certain service to the user. 6 Thus there are two possible solutions: on one the hand, we could use statistical learning to try and infer a relationship between the performance of one configuration in one context and the performance of the same configuration in a different context. These methods can be efficient; however it is often at the cost of a large overhead in terms of computation time and memory. On the other hand we could assume, if possible, that for two “close” contexts (e.g., SN R1 = 9 dB and SN R2 = 9.05 dB ) the performance of a configuration doesn’t change much. Then we can group several contexts and form a cluster. That would enable us to divide the context into several clusters (e.g., We can represent a large interval of SNRs by several clusters : [0 20]=[0 1]∪[1 2]. . . [19 20]). And finally address locally the learning problem in every cluster as one MAB problem on a fixed context. Consequently, we can duplicate the tools already used in the case of one MAB problem to deal with the case where we have several MAB problems, one in every cluster. Within this framework, every cluster shall have its own learning algorithm to estimate, on average depending on the cluster size, the best configuration. In this paper we prefer the latter approach that is very intuitive and doesn’t cost a lot of the already limited computational resources in a CR equipment. The learning tools used are the same already presented in [6]. In order to make this paper as self-sufficient as possible, we present the main ideas of the so-called Upper Confidence Bound (UCB) indexes in the next paragraph. At every instant t, an upper confidence bound index is computed for every machine k. This upper confidence bound index, denoted by Bk,t,Tk (t) , is computed from the gathered information it until the slot number t and gives an optimistic estimation of the expected reward of machine k. Let Bk,t,Tk (t) denote the index of the policies we are dealing with: 1-CR equipment: • K possible configurations Ck , k ∈ {k = 1, ..., K}, verifying the operational constraints but with unknown performances. • A cognitive decision making engine: can learn and make decisions to help the CR equipment to improve its behavior. 2-Time representation: • Time divided into slots t = 0, 1, 2, ... (Figure 5) • At the beginning of every slot t, the cognitive decision making engine decides to reconfigure or not the CR equipment. 3-Environment and performance evaluation: • Typical observations: SNR, BER, network load, throughput, spectrum bands, etc. • A numerical signal is computed at the end of every slot t and informs the cognitive decision making engine of the performance of the CR equipment. The numerical signal obtained when using configuration Ck is a function of the observations and the configurations. • The numerical results computed with a configuration Ck are assumed to be i.i.d. and drawn from an unknown stochastic distribution θk . Fig. 6. Description of the Dynamic Configuration Adaptation problem. ously mentioned paper [6] for more details about the equivalence between this CR decision making problem and the MAB framework. In the case of CR problems, these distributions θk depend on external parameters that the environment reveals at the beginning of every slot (for instance the SNR in Section V). Thus the dynamic of the problem is the following: first, the equipment senses the context of the environment (e.g., the current SNR), then depending on the outcome of this sensing, chooses a configuration to try. At the end of the transmitted slot, the CR can compute a signal that evaluates its performances during that specific slot. Finally, the CR decision making engine takes into account the new collected information to update its configuration selection policy. Bk,t,Tk (t) = X k,Tk (t) + Ak,t,Tk (t) (1) where X k,Tk (t) is the sample mean of the machine k after been played Tk (t) times at the step t, and Ak,t,Tk (t) is an upper confidence bias added to the sample mean. A policy π computes from it these indexes from which it deduces an action at as follows: B. Suggested approach The tackled framework in [6] corresponds to the herein described problem with a fixed context (e.g., fixed SNR value for all slots). However when the context changes we cannot assume a priori that the acquired knowledge is still valid in the new context. at = π(it ) = arg max(Bk,t,Tk (t) ) k 7 (2) Parameters: K, exploration coefficient α Input: it Output: at Algorithm: If: t ≤ K return at = t + 1 Else: Pt−1 • Tk (t) ← 1 ,1 ∀k m=0 q {Im =k} α. ln(t) • Ak,t,Tk (t) ← Tk (t) , ∀k Pt−1 • • rm .1 Bk,t,Tk (t) ← m=0 Tk (t){Im =k} + Ak,t,Tk (t) , ∀k return at = arg max(Bk,t,Tk (t) ) k Fig. 7. A tabular version of the U CB1 algorithm for selecting the next configuration at . Fig. 8. Performance of the different configurations depending on the SNR. We describe hereafter two specific upper confidence biases Ak,t,Tk (t) that will be used in our simulations. Assuming that the rewards are upper bounded by a positive real b > 0 we find: 1) U CB1 : is defined by Ak,t,Tk (t) such that [18]: s b2 .α. ln(t) Ak,t,Tk (t) = (3) Tk (t) play always this best candidate. However since we usually do not have this optimal division of the context space, we suffer a second loss due to the gap existing between the optimal division of the context space and the actual division of the context space. For our application in a cognitive radio context, we adapt the expression of the regret and suggest the form in Equation (5). Let Im denote the selected configuration at the slot number m then: 2) U CBV : is defined by Ak,t,Tk (t) such that [19]: s 2ξ.Vk (t). ln(t) 3.c.b.ξ. ln(t) + (4) Ak,t,Tk (t) = Tk (t) Tk (t) The regret of a policy π ∈ Π at time t (after t decisions) is defined as follows: E[Rtπ ] = t−1 X (µ∗ (SN Rm ) − µIm (SN Rm )) (5) m=0 where Vk refers to the empirical variance of the configuration k in the particular cluster considered. Finally both of these indexes are very simple to computes as we can see it in Figure 7 where a tabular version of the UCB algorithm is proposes in the case of the U CB1 index. The case of the U CBV index is strictly equivalent. where µk (SN Rm ) is the expected performance of the configuration k, at the slot number m under a context SN Rm . And µ∗ (SN Rm ) = max{µk (SN Rm )} k In the next section, we exploit the herein described algorithms within the DCA problem, implement the proposed approach and discuss the parameters. Then through several simulations, we show that the general implemented has empirically a logarithmic regret. C. Performance Evaluation. To evaluate the performance of these policies, it is convenient to use the notion of “regret”. The general idea behind the “regret” can be summarized as follows: if the gambler knew a priori which one was the best arm, he would only pull that one, and hence, maximize the expectancy of the collected rewards. However, since he lacks that essential information he will suffer unavoidable loss due to suboptimal pulls. In a similar way, if it is possible to find an adequate clustering of the environment’s context such that for every cluster there is one and only one “best candidate”, then the gambler could V. S IMULATIONS A. Experimental protocol For the simulations, we used 5 different configurations denoted by {C1 , C2 , . . . , C5 }. The curves that appear in Figure 8 represent the throughputs TCk of the different configurations Ck , k ∈ {1, 2, . . . , 5}, as a function of the SNR. Their expressions are inspired from real radio communication problems however, for the sake of generality; we only use them as tools for the simulations. 8 Usually, the radio equipments are dimensioned to provide a service within a certain interval of SNRs. This leads to a worst case analysis. Thus if we expect the designed system to provide the user with the highest throughput for a low SNR (around 6 dB in this case), then C1 would be the chosen configuration. However, in a Cognitive Radio context, we aim at finding a way to “jump” from a curve to another depending on the SNR, in order to stay on the curve that maximizes the performance of the equipment for all SN Rs. 454 lets define j = [454 940 454 2 2 940] where j(1) is a parameter associated to C1 , j(2) to the configuration C2 and so on. As for j, let M = [4 8 8 16 16]. And Let np = 20 then the performance criterion used (i.e., throughput in our case) has the following form: Fig. 9. Average cumulated regret when using UCB algorithms to tackle DCA problems.. B. Results j(k).log2 (M ) TCk (SN R) = [1− j(k) + np s 1 3.SN R.log2 (M (k)) (1 − p ).erf c( )] 2.(M (k) − 1) (M (k)) Figure 9 shows the evolution of the average cumulated regret for the different UCB policies. For the two policies, the cumulated regret first increases rather rapidly with the slot number and then more and more slowly. This shows that the CR decision making engines based on UCB policies are able to process the past information in an appropriate way even though such that configurations leading to high rewards are favored with time. Moreover by choosing a cluster size small enough to have a good local approximation on the configuration performances, yet not too small (otherwise the algorithm would spend most of its time exploring), we see that the proposed algorithms have logarithmic regrets which is known to be order optimal for the classic MAB problem. Figure 10 show that although the designed system is facing a randomly changing environment with a high variance, it manages to learn the different optimal configurations depending on the context. In both figures, U CBV has a more satisfactory behavior than U CB1 . It is probably due to the fact that U CBV takes advantage of the variance to orient its learning behavior. However, several questions still remain regarding the dependency of the designed algorithm to its parameters such as the clusters size. As a matter of fact, the system needs to adapt its clusters to optimize its behavior. Moreover, it doesn’t exploit the sparse information on the environment by communicating with the other clusters. All these questions and several others are currently under investigation. The results of these investigations will be suggested to the community in a future work. During the rest of this paper, we consider that the estimations of the throughput received by the CR decision making engine are drawn from Bernoulli distributions θk such that for all SNR we verify TCk (SN R) = E[θk (SN R)] (6) Moreover we consider that the DCA problem exist only for a bounded SN R ∈ [SN Rmin SN Rmax ] and that the SNR follows is a random variable. In this case the variable SN RdB = 10.log10 (SN R) is assumed to follow a Gaussian distribution with mean 10 dB and standard deviation of 4 dB, SN RdBmin = 6 dB and SN RdBmax = 14 dB. The SN R interval was divided into 24 equal clusters in order to have a good learning resolution. The parameters used for the UCB algorithms were chosen to make sure that these algorithm have logarithmic regret. As a matter of fact, theoretical analysis are provided in [18][19] where it is shown that parameters α ≤ 1 (case of U CB1 ) and {ξ ≤ 1 and/or c ≤ 1/3}(case of U CBV ) are risky and could lead to a bad learning behavior of the algorithm. Thus we implemented our work using the critical values α = 1 and {ξ =1 and/or c = 1/3}. Moreover we chose b = 4 as an upper bound of the possible rewards (cf. Figure 8). As a matter of fact in this case b is larger than any possible outcome of the transmission process. 9 [3] C. Moy, A. Bisiaux, and S. Paquelet. An ultra-wide band umbilical cord for cognitive radio systems. PIMRC’05, Berlin, Septembre 2005. [4] A. Kountouris and C. Moy. Reconfiguration in software radio systems. Second Karlsruhe Workshop on Software Radios, Karlshruhe, Germany, 20-21, March 2002. [5] J.P. Delahaye, P. Leray, C. Moy, and J. Palicot. Anaging Dynamic Partial Reconfiguration on Heterogeneous SDR Platforms. SDR Forum Technical Conference05, Anaheim (USA), November 2005. [6] W. Jouini, D. Ernst, C. Moy, and J. Palicot. Multi-armed bandit based policies for cognitive radio’s decision making issues. In Proceedings of the 3rd international conference on Signals, Circuits and Systems (SCS), November 2009. [7] S. Haykin. Cognitive radio: brain-empowered wireless communications. IEEE Journal on Selected Areas in Communications, 23, no. 2:201–220, Feb 2005. [8] C.J. Rieser. Biologically Inspired Cognitive Radio Engine Model Utilizing Distributed Genetic Algorithms for Secure and Robust Wireless Communications and Networking. PhD thesis, Virginia Tech, 2004. [9] N. Colson, A. Kountouris, A. Wautier, and L. Husson. Cognitive decision making process supervising the radio dynamic reconfiguration. In Proceedings of Cognitive Radio Oriented Wireless Networks and Communications, page 7, 2008. [10] DARPA XG Working Group. The XG vision. request for comments. BBN Technologies, Cambridge MA, USA, Tech. Rep. Version 2.0, January 2004. [11] L. Berlemann, S. Mangold, and B. H. Walke. Policy-based reasoning for spectrum sharing in radio networks. In Proceedings of IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks (DySPAN), Baltimore, MD, USA, November 2005. [12] T. W. Rondeau, D. Maldonado, D. Scaperoth, and C.W. Bostian. Cognitive radio formulation and implementation. IEEE Proceedings CROWNCOM, Mykonos, Greece, 2006. [13] N. Baldo and M. Zorzi. Fuzzy logic for cross-layer optimization in cognitive radio networks. IEEE Consumer Communications and Networking Conference, January 2007. [14] Charles Clancy, Joe Hecker, and Erich Stuntebeck. Applications of machine learning to cognitive radio networks. IEEE Wireless Communications Magazine, 14, 2007. [15] T. Weingart, D. Sicker, and D. Grunwald. A statistical method for reconfiguration of cognitive radios. IEEE Wireless Commun. Mag.,vol. 14, no. 4, pp. 3440, August 2007. [16] N. Kasabov. ECOS : Evolving connectionist systems and the eco learning paradigm. International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998. [17] N. Kasabov. Evolving connectionist systems. the knowledge engineering approach. 2nd ed. New York : Springer, 2007. [18] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of multi-armed bandit problems. Machine learning, 47(2/3):235–256, 2002. [19] J.-Y. Audibert, R. Munos, and C. Szepesvri. Tuning bandit algorithms in stochastic environments. In Proceedings of the 18th international conference on Algorithmic Learning Theory, 2007. Fig. 10. Percentage of times a UCB-based policy selects the optimal configuration. VI. C ONCLUSIONS In this paper, we presented a quick yet original state of the art on the different configuration adaptation challenges faced by the cognitive radio decision making community. We suggested that most of these challenges have the same constraints however they differ by the a priori knowledge they assume available. Consequently, we suggested the “a priori knowledge” as a classification criteria to discriminate the main proposed techniques in the literature to solve configuration adaptation decision making problems. Moreover we tackled the configuration adaptation decision making problem when no a priori (or very limited) information is provided to the CR equipment. We argued that this problem is a particular instance of the well known multi-armed bandit paradigm and can be efficiently addressed through UCB algorithms. Our simulation results have highlighted that by customizing algorithms developed for solving the multiarmed bandit, efficient engineering solutions to some problems met in cognitive radio can indeed be built. ACKNOWLEDGMENTS This work was supported by the European Commission in the framework of the FP7 Network of Excellence in Wireless COMmunications NEWCOM++ (contract n. 216715). R EFERENCES [1] J. Mitola. Cognitive radio: An integrated agent architecture for software defined radio. PhD Thesis, Royal Inst. of Technology (KTH), 2000. [2] R. Hachemani, J. Palicot, and C. Moy. A new standard recognition sensor for cognitive radio terminal. EUSIPCO’07, Poznan, Pologne, 3-7 septembre 2007. 10 View publication stats