Alberto Cano is an Associate Professor with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, United States, where he heads the High-Performance Data Mining laboratory. His research is focused on machine learning, big data, data streams, concept drift, continual learning, GPUs and distributed computing. He is also the Faculty Director of the High Performance Research Computing Core Facility at VCU: https://hprc.vcu.edu/
Areas of Expertise (5)
High Performance Computing
Top 2% of most cited researchers in AI field by Stanford University ranking (professional)
Stanford University Scientist Rankings
Amazon Machine Learning Award (professional)
Hate Speech Detection on Amazon Reviews using Data Stream Mining on Spark and AWS
University of Granada, Spain: Ph.D., Computer Science 2014
University of Cordoba, Spain: M.Sc., Intelligent Systems 2013
University of Granada, Spain: M.Sc., Soft Computing and Intelligent Systems 2011
University of Cordoba, Spain: B.Sc., Computer Science 2010
University of Cordoba, Spain: B.Sc., Computer Engineering 2008
Research Grants (4)
MRI: Track 1 Acquisition of NVIDIA DGX H100 GPU system for research and education at VCU
National Science Foundation
SentimentVoice: Integrating emotion AI and VR in Performing Arts
Commonwealth Cyber Initiative
Integrating emotion AI and VR in Performing Arts
HPRC research computing clusters
State Council of Higher Education for Virginia
HPRC research computing clusters
Multi-Objective Optimization of Inlet Nozzle Design using Artificial Intelligence for Single Tank Thermal Energy Storage
VCU Accelerate Fund
CMSC 508 - Databases
CMSC 603 - High Performance Distributed Systems
High Performance Distributed Systems
Selected Articles (9)
A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental frameworkMachine Learning
G. Aguiar, B. Krawczyk, and A. Cano
Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.
ROSE: Robust Online Self-Adjusting Ensemble for Continual Learning on Imbalanced Drifting Data StreamsMachine Learning
A. Cano and B. Krawczyk
Data streams are potentially unbounded sequences of instances arriving over time to a classifier. Designing algorithms that are capable of dealing with massive, rapidly arriving information is one of the most dynamically developing areas of machine learning. Such learners must be able to deal with a phenomenon known as concept drift, where the data stream may be subject to various changes in its characteristics over time. Furthermore, distributions of classes may evolve over time, leading to a highly difficult non-stationary class imbalance. In this work we introduce Robust Online Self-Adjusting Ensemble (ROSE), a novel online ensemble classifier capable of dealing with all of the mentioned challenges. The main features of ROSE are: (1) online training of base classifiers on variable size random subsets of features; (2) online detection of concept drift and creation of a background ensemble for faster adaptation to changes; (3) sliding window per class to create skew-insensitive classifiers regardless of the current imbalance ratio; and (4) self-adjusting bagging to enhance the exposure of difficult instances from minority classes. The interplay among these features leads to an improved performance in various data stream mining benchmarks. An extensive experimental study comparing with 30 ensemble classifiers shows that ROSE is a robust and well-rounded classifier for drifting imbalanced data streams, especially under the presence of noise and class imbalance drift, while maintaining competitive time complexity and memory consumption. Results are supported by a thorough non-parametric statistical analysis.
Kappa Updated Ensemble for Drifting Data Stream MiningMachine Learning
A. Cano and B. Krawczyk
Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream mining have gained significant popularity, due to their high predictive capabilities and effective mechanisms for alleviating concept drift. In this paper, we propose a new ensemble method named Kappa Updated Ensemble (KUE). It is a combination of online and block-based ensemble approaches that uses Kappa statistic for dynamic weighting and selection of base classifiers. In order to achieve a higher diversity among base learners, each of them is trained using a different subset of features and updated with new instances with given probability following a Poisson distribution. Furthermore, we update the ensemble with new classifiers only when they contribute positively to the improvement of the quality of the ensemble. Finally, each base classifier in KUE is capable of abstaining itself for taking a part in voting, thus increasing the overall robustness of KUE. An extensive experimental study shows that KUE is capable of outperforming state-of-the-art ensembles on standard and imbalanced drifting data streams while having a low computational complexity. Moreover, we analyze the use of Kappa vs accuracy to drive the criterion to select and update the classifiers, the contribution of the abstaining mechanism, the contribution of the diversification of classifiers, and the contribution of the hybrid architecture to update the classifiers in an online manner.
Multi-label Punitive kNN with Self-Adjusting Memory for Drifting Data StreamsACM Transactions on Knowledge Discovery from Data
M. Roseberry, B. Krawczyk, and A. Cano
In multi-label learning, data may simultaneously belong to more than one class. When multi-label data arrives as a stream, the challenges associated with multi-label learning are joined by those of data stream mining, including the need for algorithms that are fast and flexible, able to match both the speed and evolving nature of the stream. This paper presents a punitive k nearest neighbors algorithm with a self-adjusting memory (MLSAMPkNN) for multi-label, drifting data streams. The memory adjusts in size to contain only the current concept and a novel punitive system identifies and penalizes errant data examples early, removing them from the window. By retaining and using only data that are both current and beneficial, MLSAMPkNN is able to adapt quickly and efficiently to changes within the data stream while still maintaining a low computational complexity. Additionally, the punitive removal mechanism offers increased robustness to various data-level difficulties present in data streams, such as class imbalance and noise. The experimental study compares the proposal to 24 algorithms using 30 real-world and 15 artificial multi-label data streams on six multi-label metrics, evaluation time, and memory consumption. The superior performance of the proposed method is validated through non-parametric statistical analysis, proving both high accuracy and low time complexity. MLSAMPkNN is a versatile classifier, capable of returning excellent performance in diverse stream scenarios.
Evolving Rule-Based Classifiers with Genetic Programming on GPUs for Drifting Data StreamsPattern Recognition
A. Cano and B. Krawczyk
Designing efficient algorithms for mining massive high-speed data streams has become one of the contemporary challenges for the machine learning community. Such models must display highest possible accuracy and ability to swiftly adapt to any kind of changes, while at the same time being characterized by low time and memory complexities. However, little attention has been paid to designing learning systems that will allow us to gain a better understanding of incoming data. There are few proposals on how to design interpretable classifiers for drifting data streams, yet most of them are characterized by a significant trade-off between accuracy and interpretability. In this paper, we show that it is possible to have all of these desirable properties in one model. We introduce ERulesD2S: evolving rule-based classifier for drifting data Streams. By using grammar-guided genetic programming, we are able to obtain accurate sets of rules per class that are able to adapt to changes in the stream without a need for an explicit drift detector. Additionally, we augment our learning model with new proposals for rule propagation and data stream sampling, in order to maintain a balance between learning and forgetting of concepts. To improve efficiency of mining massive and non-stationary data, we implement ERulesD2S parallelized on GPUs. A thorough experimental study on 30 datasets proves that ERulesD2S is able to efficiently adapt to any type of concept drift and outperform state-of-the-art rule-based classifiers, while using small number of rules. At the same time ERulesD2S is highly competitive to other single and ensemble learners in terms of accuracy and computational complexity, while offering fully interpretable classification rules. Additionally, we show that ERulesD2S can scale-up efficiently to high-dimensional data streams, while offering very fast update and classification times. Finally, we present the learning capabilities of ERulesD2S for sparsely labeled data streams.
Interpretable Multi-view Early Warning System adapted to Underrepresented Student PopulationsIEEE Transactions on Learning Technologies
A. Cano and J.D. Leonard
Early warning systems have been progressively implemented in higher education institutions to predict student performance. However, they usually fail at effectively integrating the many information sources available at universities to make more accurate and timely predictions, they often lack decision-making reasoning to motivate the reasons behind the predictions, and they are generally biased toward the general student body, ignoring the idiosyncrasies of underrepresented student populations (determined by socio-demographic factors such as race, gender, residency, or status as a freshmen, transfer, adult, or first-generation students) that traditionally have greater difficulties and performance gaps. This paper presents a multiview early warning system built with comprehensible Genetic Programming classification rules adapted to specifically target underrepresented and underperforming student populations. The system integrates many student information repositories using multiview learning to improve the accuracy and timing of the predictions. Three interfaces have been developed to provide personalized and aggregated comprehensible feedback to students, instructors, and staff to facilitate early intervention and student support. Experimental results, validated with statistical analysis, indicate that this multiview learning approach outperforms traditional classifiers. Learning outcomes will help instructors and policy-makers to deploy strategies to increase retention and improve academics.
A survey on graphic processing unit computing for large-scale data miningWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
General purpose computation using Graphic Processing Units (GPUs) is a well‐established research area focusing on high‐performance computing solutions for massively parallelizable and time‐consuming problems. Classical methodologies in machine learning and data mining cannot handle processing of massive and high‐speed volumes of information in the context of the big data era. GPUs have successfully improved the scalability of data mining algorithms to address significantly larger dataset sizes in many application areas. The popularization of distributed computing frameworks for big data mining opens up new opportunities for transformative solutions combining GPUs and distributed frameworks. This survey analyzes current trends in the use of GPU computing for large‐scale data mining, discusses GPU architecture advantages for handling volume and velocity of data, identifies limitation factors hampering the scalability of the problems, and discusses open issues and future directions.
Distributed Nearest Neighbor Classification for Large-Scale Multi-label Data on SparkFuture Generation Computer Systems
J. Gonzalez-Lopez, S. Ventura, and A. Cano
Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor ( Ml-knn). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed Ml-knn implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables Ml-knn to scale efficiently with respect to the size of the problem.
MIRSVM: Multi-Instance Support Vector Machine with Bag RepresentativesPattern Recognition
G. Melki, A. Cano, and S. Ventura
Multiple-instance learning (MIL) is a variation of supervised learning, where samples are represented by labeled bags, each containing sets of instances. The individual labels of the instances within a bag are unknown, and labels are assigned based on a multi-instance assumption. One of the major complexities associated with this type of learning is the ambiguous relationship between a bag’s label and the instances it contains. This paper proposes a novel support vector machine (SVM) multiple-instance formulation and presents an algorithm with a bag-representative selector that trains the SVM based on bag-level information, named MIRSVM. The contribution is able to identify instances that highly impact classification, i.e. bag-representatives, for both positive and negative bags, while finding the optimal class separation hyperplane. Unlike other multi-instance SVM methods, this approach eliminates possible class imbalance issues by allowing both positive and negative bags to have at most one representative, which constitute as the most contributing instances to the model. The experimental study evaluates and compares the performance of this proposal against 11 state-of-the-art multi-instance methods over 15 datasets, and the results are validated through non-parametric statistical analysis. The results indicate that bag-based learners outperform the instance-based and wrapper methods, as well as MIRSVM’s overall superior performance against other multi-instance SVM models, having an average accuracy of 82.6%, which is 2.5% better than the best performing state-of-the-art MI classifier.