Bartosz Krawczyk is an assistant professor in the Department of Computer Science, Virginia Commonwealth University, Richmond VA, USA, where he heads the Machine Learning and Stream Mining Lab. He obtained his MSc and PhD degrees from Wroclaw University of Science and Technology, Wroclaw, Poland, in 2012 and 2015 respectively. His research is focused on machine learning, data streams, ensemble learning, class imbalance, one-class classifiers, and interdisciplinary applications of these methods. He has authored 35+ international journal papers and 80+ contributions to conferences. Dr Krawczyk was awarded with prestigious awards for his scientific achievements, including IEEE Richard E. Merwin Scholarship, IEEE Outstanding Leadership, START award from Foundation for Polish Science (twice), scholarship for excellent research achievements from Polish Minister of Science and Higher Education (twice), Czeslaw Rodkiewicz Foundation award for merging technical and medical sciences, and Hugo Steinhaus award for achievements in computer science among others. He served as a Guest Editor in four journal special issues (including Information Fusion and Neurocomputing) and as a chair of ten special session and workshops (organized at such conferences as ECML-PKDD or ICCS). He is a member of Program Committee for over 40 international conferences and a reviewer for 30 journals.
Industry Expertise (3)
Areas of Expertise (5)
IEEE Outsanding Leadership Award (professional)
Best paper award at 9th Computer Recognition Systems Conference CORES (professional)
IEEE Richard E. Merwin Scholarship (professional)
IEEE Travel Award for distinctive paper at World Congress on Computational Intelligence (professional)
Wroclaw University of Science and Technology: Ph.D., Computer Science 2015
Wroclaw University of Science and Technology: M.S., Computer Science 2012
Wroclaw University of Science and Technology: B.S., Computer Science 2011
Event Appearances (3)
10 International Conference on Computer Recognition Systems CORES 2017 Polanica-Zdroj, Poland
12th International Conference on Hybrid Artificial Intelligence Systems HAIS 2017 La Rioja, Spain
Third International Symposium on Signal Processing and Intelligent Recognition Systems SIRS 17 Manipal, India
Selected Articles (8)
Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Data preprocessing and reduction have become essential techniques in current knowledge discovery scenarios, dominated by increasingly large datasets. These methods aim at reducing the complexity inherent to real-world datasets, so that they can be easily processed by current data mining solutions. Advantages of such approaches include, among others, a faster and more precise learning process, and more understandable structure of raw data. However, in the context of data preprocessing techniques for data streams have a long road ahead of them, despite online learning is growing in importance thanks to the development of Internet and technologies for massive data collection. Throughout this survey, we summarize, categorize and analyze those contributions on data preprocessing that cope with streaming data. This work also takes into account the existing relationships between the different families of methods (feature and instance selection, and discretization). To enrich our study, we conduct thorough experiments using the most relevant contributions and present an analysis of their predictive performance, reduction rates, computational time, and memory usage. Finally, we offer general advices about existing data stream preprocessing algorithms, as well as discuss emerging future challenges to be faced in the domain of data stream preprocessing.
Canonical machine learning algorithms assume that the number of objects in the considered classes are roughly similar. However, in many real-life situations the distribution of examples is skewed since the examples of some of the classes appear much more frequently. This poses a difficulty to learning algorithms, as they will be biased towards the majority classes. In recent years many solutions have been proposed to tackle imbalanced classification, yet they mainly concentrate on binary scenarios. Multi-class imbalanced problems are far more difficult as the relationships between the classes are no longer straightforward. Additionally, one should analyze not only the imbalance ratio but also the characteristics of the objects within each class. In this paper we present a study on oversampling for multi-class imbalanced datasets that focuses on the analysis of the class characteristics. We detect subsets of specific examples in each class and fix the oversampling for each of them independently. Thus, we are able to use information about the class structure and boost the more difficult and important objects. We carry an extensive experimental analysis, which is backed-up with statistical analysis, in order to check when the preprocessing of some types of examples within a class may improve the indiscriminate preprocessing of all the examples in all the classes. The results obtained show that oversampling concrete types of examples may lead to a significant improvement over standard multi-class preprocessing that do not consider the importance of example types.
One-class classification is among the most difficult areas of the contemporary machine learning. The main problem lies in selecting the model for the data, as we do not have any access to counterexamples, and cannot use standard methods for estimating the classifier quality. Therefore ensemble methods that can use more than one model, are a highly attractive solution. With an ensemble approach, we prevent the situation of choosing the weakest model and usually improve the robustness of our recognition system. However, one cannot assume that all classifiers available in the pool are in general accurate – they may have local competence areas in which they should be employed. In this work, we present a dynamic classifier selection method for constructing efficient one-class ensembles. We propose to calculate the competencies of all classifiers for a given validation example and use them to estimate their competencies over the entire decision space with the Gaussian potential function. We introduce three measures of classifier’s competence designed specifically for one-class problems. Comprehensive experimental analysis, carried on a number of benchmark data and backed-up with a thorough statistical analysis prove the usefulness of the proposed approach.
Multi-class imbalance classification problems occur in many real-world applications, which suffer from the quite different distribution of classes. Decomposition strategies are well-known techniques to address the classification problems involving multiple classes. Among them binary approaches using one-vs-one and one-vs-all has gained a significant attention from the research community. They allow to divide multi-class problems into several easier-to-solve two-class sub-problems. In this study we develop an exhaustive empirical analysis to explore the possibility of empowering the one-vs-one scheme for multi-class imbalance classification problems with applying binary ensemble learning approaches. We examine several state-of-the-art ensemble learning methods proposed for addressing the imbalance problems to solve the pairwise tasks derived from the multi-class data set. Then the aggregation strategy is employed to combine the binary ensemble outputs to reconstruct the original multi-class task. We present a detailed experimental study of the proposed approach, supported by the statistical analysis. The results indicate the high effectiveness of ensemble learning with one-vs-one scheme in dealing with the multi-class imbalance classification problems.
One of the crucial problems of the classifier ensemble is the so-called combination rule which is responsible for establishing a single decision from the pool of predictors. The final decision is made on the basis of the outputs of individual classifiers. At the same time, some of the individuals do not contribute much to the collective decision and may be discarded. This paper discusses how to design an effective combination rule, based on support functions returned by individual classifiers. We express our interest in aggregation methods which do not require training, because in many real-life problems we do not have an abundance of training objects or we are working under time constraints. Additionally, we show how to use proposed operators for simultaneous classifier combination and ensemble pruning. Our proposed schemes have embedded classifier selection step, which is based on weight thresholding. The experimental analysis carried out on the set of benchmark datasets and backed up with a statistical analysis, proved the usefulness of the proposed method, especially when the number of class labels is high.
In this paper, we propose a complete, fully automatic and efficient clinical decision support system for breast cancer malignancy grading. The estimation of the level of a cancer malignancy is important to assess the degree of its progress and to elaborate a personalized therapy. Our system makes use of both Image Processing and Machine Learning techniques to perform the analysis of biopsy slides. Three different image segmentation methods (fuzzy c-means color segmentation, level set active contours technique and grey-level quantization method) are considered to extract the features used by the proposed classification system. In this classification problem, the highest malignancy grade is the most important to be detected early even though it occurs in the lowest number of cases, and hence the malignancy grading is an imbalanced classification problem. In order to overcome this difficulty, we propose the usage of an efficient ensemble classifier named EUSBoost, which combines a boosting scheme with evolutionary undersampling for producing balanced training sets for each one of the base classifiers in the final ensemble. The usage of the evolutionary approach allows us to select the most significant samples for the classifier learning step (in terms of accuracy and a new diversity term included in the fitness function), thus alleviating the problems produced by the imbalanced scenario in a guided and effective way. Experiments, carried on a large dataset collected by the authors, confirm the high efficiency of the proposed system, shows that level set active contours technique leads to an extraction of features with the highest discriminative power, and prove that EUSBoost is able to outperform state-of-the-art ensemble classifiers in a real-life imbalanced medical problem.