Bartosz Krawczyk is an assistant professor in the Department of Computer Science, Virginia Commonwealth University, Richmond VA, USA, where he heads the Machine Learning and Stream Mining Lab. He obtained his MSc and PhD degrees from Wroclaw University of Science and Technology, Wroclaw, Poland, in 2012 and 2015 respectively.
His current research interests include machine learning, data streams, class imbalance, continual learning, and explainable artificial intelligence. He has authored more than 60 journal articles and more than 100 contributions to conferences. He has co-authored the book "Learning from Imbalanced Datasets" (Springer, 2018).
Dr. Krawczyk is a Program Committee member for high-ranked conferences, such as KDD (Senior PC member), AAAI, IJCAI, ECML-PKDD, IEEE BigData, and IJCNN. He was a recipient of prestigious awards for his scientific achievements such as the IEEE Richard Merwin Scholarship, the IEEE Outstanding Leadership Award, and the Amazon Machine Learning Award, among others. He served as a Guest Editor for four journal special issues and as the Chair for 20 special session and workshops. He is the member of the editorial board for Applied Soft Computing (Elsevier).
Industry Expertise (3)
Areas of Expertise (3)
Machine learning: ensembles, imbalanced data, one-class classification, kernel methods.
Data stream mining: concept drift, active learning, online classification and regression.
Big data: mining massive datasets, efficient and scalable learning algorithms
Top 2% of most cited researchers in AI field by Stanford University ranking (professional)
Amazon Machine Learning Award (professional)
IEEE Outsanding Leadership Award (professional)
Best paper award at 9th Computer Recognition Systems Conference CORES (professional)
IEEE Richard E. Merwin Scholarship (professional)
IEEE Travel Award for distinctive paper at World Congress on Computational Intelligence (professional)
Wroclaw University of Science and Technology: Ph.D., Computer Science 2015
Wroclaw University of Science and Technology: M.S., Computer Science 2012
Wroclaw University of Science and Technology: B.S., Computer Science 2011
Event Appearances (6)
4th Workshop on Deep Learning Practice and Theory for High-Dimensional Sparse and Imbalanced Data of 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining KDD 2022 Washington DC, USA
3rd Workshop on Learning with Imbalanced Domains of European Conference on Machine Learning and Principles of Data Mining and Knowledge Discovery ECML-PKDD 2021 Bilbao, Spain
19th International Conference on Artificial Intelligence and Soft Computing ICAISC 2020 Zakopane, Poland
1st Workshop on Active Learning of European Conference on Machine Learning and Principles of Data Mining and Knowledge Discovery ECML-PKDD 2017 Skopje, Macedonia
10 International Conference on Computer Recognition Systems CORES 2017 Polanica-Zdroj, Poland
12th International Conference on Hybrid Artificial Intelligence Systems HAIS 2017 La Rioja, Spain
Selected Articles (16)
The class imbalance problem in deep learningMachine Learning
Kushankur Ghosh, Colin Bellinger, Roberto Corizzo, Paula Branco, Bartosz Krawczyk, Nathalie Japkowicz
Deep learning has recently unleashed the ability for Machine learning (ML) to make unparalleled strides. It did so by confronting and successfully addressing, at least to a certain extent, the knowledge bottleneck that paralyzed ML and artificial intelligence for decades. The community is currently basking in deep learning’s success, but a question that comes to mind is: have all of the issues previously affecting machine learning systems been solved by deep learning or do some issues remain for which deep learning is not a bulletproof solution? This question in the context of the class imbalance becomes a motivation for this paper. Imbalance problem was first recognized almost three decades ago and has remained a critical challenge at least for traditional learning approaches. Our goal is to investigate whether the tight dependency between class imbalances, concept complexities, dataset size and classifier performance, known to exist in traditional learning systems, is alleviated in any way in deep learning approaches and to what extent, if any, network depth and regularization can help. To answer these questions we conduct a survey of the recent literature focused on deep learning and the class imbalance problem as well as a series of controlled experiments on both artificial and real-world domains. This allows us to formulate lessons learned about the impact of class imbalance on deep learning models, as well as pose open challenges that should be tackled by researchers in this field.
Adversarial concept drift detection under poisoning attacks for robust data stream miningMachine Learning
Lukasz Korycki, Bartosz Krawczyk
Continuous learning from streaming data is among the most challenging topics in the contemporary machine learning. In this domain, learning algorithms must not only be able to handle massive volume of rapidly arriving data, but also adapt themselves to potential emerging changes. The phenomenon of evolving nature of data streams is known as concept drift. While there is a plethora of methods designed for detecting its occurrence, all of them assume that the drift is connected with underlying changes in the source of data. However, one must consider the possibility of a malicious injection of false data that simulates a concept drift. This adversarial setting assumes a poisoning attack that may be conducted in order to damage the underlying classification system by forcing an adaptation to false data. Existing drift detectors are not capable of differentiating between real and adversarial concept drift. In this paper, we propose a framework for robust concept drift detection in the presence of adversarial and poisoning attacks. We introduce the taxonomy for two types of adversarial concept drifts, as well as a robust trainable drift detector. It is based on the augmented restricted Boltzmann machine with improved gradient computation and energy function. We also introduce Relative Loss of Robustness—a novel measure for evaluating the performance of concept drift detectors under poisoning attacks. Extensive computational experiments, conducted on both fully and sparsely labeled data streams, prove the high robustness and efficacy of the proposed drift detection framework in adversarial scenarios.
Instance exploitation for learning temporary concepts from sparsely labeled drifting data streamsPattern Recognition
Lukasz Korycki, Bartosz Krawczyk
Continual learning from streaming data sources becomes more and more popular due to the increasing number of online tools and systems. Dealing with dynamic and everlasting problems poses new challenges for which traditional batch-based offline algorithms turn out to be insufficient in terms of computational time and predictive performance. One of the most crucial limitations is that we cannot assume having an access to a finite and complete data set – we always have to be ready for new data that may complement our model. This poses a critical problem of providing labels for potentially unbounded streams. In real world, we are forced to deal with very strict budget limitations, therefore, we will most likely face the scarcity of annotated instances, which are essential in supervised learning. In our work, we emphasize this problem and propose a novel instance exploitation technique. We show that when: (i) data is characterized by temporary non-stationary concepts, and (ii) there are very few labels spanned across a long time horizon, it is actually better to risk overfitting and adapt models more aggressively by exploiting the only labeled instances we have, instead of sticking to a standard learning mode and suffering from severe underfitting. We present different strategies and configurations for our methods, as well as an ensemble algorithm that attempts to maintain a sweet spot between risky and normal adaptation. Finally, we conduct a complex in-depth comparative analysis of our methods, using state-of-the-art streaming algorithms relevant for the given problem.
DeepSMOTE: Fusing deep learning and SMOTE for imbalanced dataIEEE Transactions on Neural Networks and Learning Systems
Damien Dablain, Bartosz Krawczyk, Nitesh V Chawla
Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have further magnified the importance of the imbalanced data problem, especially when learning from images. Therefore, there is a need for an oversampling method that is specifically tailored to deep learning models, can work on raw images while preserving their properties, and is capable of generating high-quality, artificial images that can enhance minority classes and balance the training set. We propose Deep synthetic minority oversampling technique (SMOTE), a novel oversampling algorithm for deep learning models that leverages the properties of the successful SMOTE algorithm. It is simple, yet effective in its design. It consists of three major components: 1) an encoder/decoder framework; 2) SMOTE-based oversampling; and 3) a dedicated loss function that is enhanced with a penalty term. An important advantage of DeepSMOTE over generative adversarial network (GAN)-based oversampling is that DeepSMOTE does not require a discriminator, and it generates high-quality artificial images that are both information-rich and suitable for visual inspection. DeepSMOTE code is publicly available at https://github.com/dd1github/DeepSMOTE.
Radial-Based Oversampling for Multiclass Imbalanced Data ClassificationIEEE Transactions on Neural Networks and Learning Systems
Bartosz Krawczyk, Michal Koziarski, Michal Wozniak
Learning from imbalanced data is among the most popular topics in the contemporary machine learning. However, the vast majority of attention in this field is given to binary problems, while their much more difficult multiclass counterparts are relatively unexplored. Handling data sets with multiple skewed classes poses various challenges and calls for a better understanding of the relationship among classes. In this paper, we propose multiclass radial-based oversampling (MC-RBO), a novel data-sampling algorithm dedicated to multiclass problems. The main novelty of our method lies in using potential functions for generating artificial instances. We take into account information coming from all of the classes, contrary to existing multiclass oversampling approaches that use only minority class characteristics. The process of artificial instance generation is guided by exploring areas where the value of the mutual class distribution is very small. This way, we ensure a smart oversampling procedure that can cope with difficult data distributions and alleviate the shortcomings of existing methods. The usefulness of the MC-RBO algorithm is evaluated on the basis of extensive experimental study and backed-up with a thorough statistical analysis. Obtained results show that by taking into account information coming from all of the classes and conducting a smart oversampling, we can significantly improve the process of learning from multiclass imbalanced data.
Kappa Updated Ensemble for drifting data stream miningMachine Learning
Alberto Cano, Bartosz Krawczyk:
Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream mining have gained significant popularity, due to their high predictive capabilities and effective mechanisms for alleviating concept drift. In this paper, we propose a new ensemble method named Kappa Updated Ensemble (KUE). It is a combination of online and block-based ensemble approaches that uses Kappa statistic for dynamic weighting and selection of base classifiers. In order to achieve a higher diversity among base learners, each of them is trained using a different subset of features and updated with new instances with given probability following a Poisson distribution. Furthermore, we update the ensemble with new classifiers only when they contribute positively to the improvement of the quality of the ensemble. Finally, each base classifier in KUE is capable of abstaining itself for taking a part in voting, thus increasing the overall robustness of KUE. An extensive experimental study shows that KUE is capable of outperforming state-of-the-art ensembles on standard and imbalanced drifting data streams while having a low computational complexity. Moreover, we analyze the use of Kappa versus accuracy to drive the criterion to select and update the classifiers, the contribution of the abstaining mechanism, the contribution of the diversification of classifiers, and the contribution of the hybrid architecture to update the classifiers in an online manner.
Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noiseKnowledge-Based Systems
Michal Koziarski, Michal Wozniak, Bartosz Krawczyk
The imbalanced data classification is one of the most crucial tasks facing modern data analysis. Especially when combined with other difficulty factors, such as the presence of noise, overlapping class distributions, and small disjuncts, data imbalance can significantly impact the classification performance. Furthermore, some of the data difficulty factors are known to affect the performance of the existing oversampling strategies, in particular SMOTE and its derivatives. This effect is especially pronounced in the multi-class setting, in which the mutual imbalance relationships between the classes complicate even further. Despite that, most of the contemporary research in the area of data imbalance focuses on the binary classification problems, while their more difficult multi-class counterparts are relatively unexplored. In this paper, we propose a novel oversampling technique, a Multi-Class Combined Cleaning and Resampling (MC-CCR) algorithm. The proposed method utilizes an energy-based approach to modeling the regions suitable for oversampling, less affected by small disjuncts and outliers than SMOTE. It combines it with a simultaneous cleaning operation, the aim of which is to reduce the effect of overlapping class distributions on the performance of the learning algorithms. Finally, by incorporating a dedicated strategy of handling the multi-class problems, MC-CCR is less affected by the loss of information about the inter-class relationships than the traditional multi-class decomposition strategies. Based on the results of experimental research carried out for many multi-class imbalanced benchmark datasets, the high robust of the proposed approach to noise was shown, as well as its high quality compared to the state-of-art methods.
Adaptive Ensemble Active Learning for Drifting Data Stream MiningTwenty-Eighth International Joint Conference on Artificial Intelligence IJCAI2019
Bartosz Krawczyk, Alberto Cano
Learning from data streams is among the most vital contemporary fields in machine learning and data mining. Streams pose new challenges to learning systems, due to their volume and velocity, as well as ever-changing nature caused by concept drift. Vast majority of works for data streams assume a fully supervised learning scenario, having an unrestricted access to class labels. This assumption does not hold in real-world applications, where obtaining ground truth is costly and time-consuming. Therefore, we need to carefully select which instances should be labeled, as usually we are working under a strict label budget. In this paper, we propose a novel active learning approach based on ensemble algorithms that is capable of using multiple base classifiers during the label query process. It is a plug-in solution, capable of working with most of existing streaming ensemble classifiers. We realize this process as a Multi-Armed Bandit problem, obtaining an efficient and adaptive ensemble active learning procedure by selecting the most competent classifier from the pool for each query. In order to better adapt to concept drifts, we guide our instance selection by measuring the generalization capabilities of our classifiers. This adaptive solution leads not only to better instance selection under sparse access to class labels, but also to improved adaptation to various types of concept drift and increasing the diversity of the underlying ensemble classifier.
Evolving rule-based classifiers with genetic programming on GPUs for drifting data streamsPattern Recognition
Alberto Cano, Bartosz Krawczyk
Designing efficient algorithms for mining massive high-speed data streams has become one of the contemporary challenges for the machine learning community. Such models must display highest possible accuracy and ability to swiftly adapt to any kind of changes, while at the same time being characterized by low time and memory complexities. However, little attention has been paid to designing learning systems that will allow us to gain a better understanding of incoming data. There are few proposals on how to design interpretable classifiers for drifting data streams, yet most of them are characterized by a significant trade-off between accuracy and interpretability. In this paper, we show that it is possible to have all of these desirable properties in one model. We introduce ERulesD2S: evolving rule-based classifier for drifting data Streams. By using grammar-guided genetic programming, we are able to obtain accurate sets of rules per class that are able to adapt to changes in the stream without a need for an explicit drift detector. Additionally, we augment our learning model with new proposals for rule propagation and data stream sampling, in order to maintain a balance between learning and forgetting of concepts. To improve efficiency of mining massive and non-stationary data, we implement ERulesD2S parallelized on GPUs. A thorough experimental study on 30 datasets proves that ERulesD2S is able to efficiently adapt to any type of concept drift and outperform state-of-the-art rule-based classifiers, while using small number of rules. At the same time ERulesD2S is highly competitive to other single and ensemble learners in terms of accuracy and computational complexity, while offering fully interpretable classification rules. Additionally, we show that ERulesD2S can scale-up efficiently to high-dimensional data streams, while offering very fast update and classification times. Finally, we present the learning capabilities of ERulesD2S for sparsely labeled data streams.
Multi-Label Punitive kNN with Self-Adjusting Memory for Drifting Data StreamsACM Transactions on Knowledge Discovery from Data
Martha Roseberry, Bartosz Krawczyk, Alberto Cano
In multi-label learning, data may simultaneously belong to more than one class. When multi-label data arrives as a stream, the challenges associated with multi-label learning are joined by those of data stream mining, including the need for algorithms that are fast and flexible, able to match both the speed and evolving nature of the stream. This article presents a punitive k nearest neighbors algorithm with a self-adjusting memory (MLSAMPkNN) for multi-label, drifting data streams. The memory adjusts in size to contain only the current concept and a novel punitive system identifies and penalizes errant data examples early, removing them from the window. By retaining and using only data that are both current and beneficial, MLSAMPkNN is able to adapt quickly and efficiently to changes within the data stream while still maintaining a low computational complexity. Additionally, the punitive removal mechanism offers increased robustness to various data-level difficulties present in data streams, such as class imbalance and noise. The experimental study compares the proposal to 24 algorithms using 30 real-world and 15 artificial multi-label data streams on six multi-label metrics, evaluation time, and memory consumption. The superior performance of the proposed method is validated through non-parametric statistical analysis, proving both high accuracy and low time complexity. MLSAMPkNN is a versatile classifier, capable of returning excellent performance in diverse stream scenarios.
Dynamic ensemble selection for multi-class classification with one-class classifiersPattern Recognition
Bartosz Krawczyk, Mikel Galar, Michal Wozniak, Humberto Bustince, Francisco Herrera
n this paper we deal with the problem of addressing multi-class problems with decomposition strategies. Based on the divide-and-conquer principle, a multi-class problem is divided into a number of easier to solve sub-problems. In order to do so, binary decomposition is considered to be the most popular approach. However, when using this strategy we may deal with the problem of non-competent classifiers. Otherwise, recent studies highlighted the potential usefulness of one-class classifiers for this task. Despite not using all the available knowledge, one-class classifiers have several desirable properties that may benefit the decomposition task. From this perspective, we propose a novel approach for combining one-class classifiers to solve multi class problems based on dynamic ensemble selection, which allows us to discard non-competent classifiers to improve the robustness of the combination phase. We consider the neighborhood of each instance to decide whether a classifier may be competent or not. We further augment this with a threshold option that prevents from the selection of classifiers corresponding to classes with too little examples in this neighborhood. To evaluate the usefulness of our approach an extensive experimental study is carried out, backed-up by a thorough statistical analysis. The results obtained show the high quality of our proposal and that the dynamic selection of one-class classifiers is a useful tool for decomposing multi-class problems.
Synthetic Oversampling with the Majority Class: A New Perspective on Handling Extreme ImbalanceIEEE International Conference on Data Mining ICDM 2018
Shiven Sharma, Colin Bellinger, Bartosz Krawczyk, Osmar R. Zaïane, Nathalie Japkowicz
The class imbalance problem is a pervasive issue in many real-world domains. Oversampling methods that inflate the rare class by generating synthetic data are amongst the most popular techniques for resolving class imbalance. However, they concentrate on the characteristics of the minority class and use them to guide the oversampling process. By completely overlooking the majority class, they lose a global view on the classification problem and, while alleviating the class imbalance, may negatively impact learnability by generating borderline or overlapping instances. This becomes even more critical when facing extreme class imbalance, where the minority class is strongly underrepresented and on its own does not contain enough information to conduct the oversampling process. We propose a novel method for synthetic oversampling that uses the rich information inherent in the majority class to synthesize minority class data. This is done by generating synthetic data that is at the same Mahalanbois distance from the majority class as the known minority instances. We evaluate over 26 benchmark datasets, and show that our method offers a distinct performance improvement over the existing state-of-the-art in oversampling techniques.
Ensemble learning for data stream analysis: A surveyInformation Fusion
Bartosz Krawczyk, Leandro L. Minku, João Gama, Jerzy Stefanowski, Michal Wozniak
2017 In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Nearest Neighbor Classification for High-Speed Big Data Streams Using SparkIEEE Transactions on Systems, Man, and Cybernetics: Systems
Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, Michal Wozniak, José Manuel Benítez, Francisco Herrera
2017 Abstract: Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.
A survey on Data Preprocessing for Data Stream Mining: Current status and future directionsNeurocomputing
Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, Michal Wozniak, Francisco Herrera
2017 Data preprocessing and reduction have become essential techniques in current knowledge discovery scenarios, dominated by increasingly large datasets. These methods aim at reducing the complexity inherent to real-world datasets, so that they can be easily processed by current data mining solutions. Advantages of such approaches include, among others, a faster and more precise learning process, and more understandable structure of raw data. However, in the context of data preprocessing techniques for data streams have a long road ahead of them, despite online learning is growing in importance thanks to the development of Internet and technologies for massive data collection. Throughout this survey, we summarize, categorize and analyze those contributions on data preprocessing that cope with streaming data. This work also takes into account the existing relationships between the different families of methods (feature and instance selection, and discretization). To enrich our study, we conduct thorough experiments using the most relevant contributions and present an analysis of their predictive performance, reduction rates, computational time, and memory usage. Finally, we offer general advices about existing data stream preprocessing algorithms, as well as discuss emerging future challenges to be faced in the domain of data stream preprocessing.
Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets.Pattern Recognition
José A. Sáez, Bartosz Krawczyk, Michal Wozniak
2016 Canonical machine learning algorithms assume that the number of objects in the considered classes are roughly similar. However, in many real-life situations the distribution of examples is skewed since the examples of some of the classes appear much more frequently. This poses a difficulty to learning algorithms, as they will be biased towards the majority classes. In recent years many solutions have been proposed to tackle imbalanced classification, yet they mainly concentrate on binary scenarios. Multi-class imbalanced problems are far more difficult as the relationships between the classes are no longer straightforward. Additionally, one should analyze not only the imbalance ratio but also the characteristics of the objects within each class. In this paper we present a study on oversampling for multi-class imbalanced datasets that focuses on the analysis of the class characteristics. We detect subsets of specific examples in each class and fix the oversampling for each of them independently. Thus, we are able to use information about the class structure and boost the more difficult and important objects. We carry an extensive experimental analysis, which is backed-up with statistical analysis, in order to check when the preprocessing of some types of examples within a class may improve the indiscriminate preprocessing of all the examples in all the classes. The results obtained show that oversampling concrete types of examples may lead to a significant improvement over standard multi-class preprocessing that do not consider the importance of example types.