hero image
Alberto Cano, Ph.D. - VCU College of Engineering. Richmond, VA, US

Alberto Cano, Ph.D.

Associate Professor | VCU College of Engineering


Dr. Cano specializes in machine learning, data mining, classification, big data, data streams, and high-performance computing.



Alberto Cano is an Associate Professor with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, United States, where he heads the High-Performance Data Mining laboratory. His research is focused on machine learning, big data, data streams, concept drift, continual learning, GPUs and distributed computing. He is also the Faculty Director of the High Performance Research Computing Core Facility at VCU: https://hprc.vcu.edu/

Areas of Expertise (5)

Machine Learning

Data Mining


Big Data

High Performance Computing

Accomplishments (2)

Top 2% of most cited researchers in AI field by Stanford University ranking (professional)


Stanford University Scientist Rankings

Amazon Machine Learning Award (professional)


Hate Speech Detection on Amazon Reviews using Data Stream Mining on Spark and AWS

Education (5)

University of Granada, Spain: Ph.D., Computer Science 2014

University of Cordoba, Spain: M.Sc., Intelligent Systems 2013

University of Granada, Spain: M.Sc., Soft Computing and Intelligent Systems 2011

University of Cordoba, Spain: B.Sc., Computer Science 2010

University of Cordoba, Spain: B.Sc., Computer Engineering 2008

Research Grants (4)

MRI: Track 1 Acquisition of NVIDIA DGX H100 GPU system for research and education at VCU

National Science Foundation 



view more

SentimentVoice: Integrating emotion AI and VR in Performing Arts

Commonwealth Cyber Initiative 


Integrating emotion AI and VR in Performing Arts

HPRC research computing clusters

State Council of Higher Education for Virginia 


HPRC research computing clusters

Multi-Objective Optimization of Inlet Nozzle Design using Artificial Intelligence for Single Tank Thermal Energy Storage

VCU Accelerate 


VCU Accelerate Fund

Courses (2)

CMSC 508 - Databases

Database Theory

CMSC 603 - High Performance Distributed Systems

High Performance Distributed Systems

Selected Articles (9)

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Machine Learning

G. Aguiar, B. Krawczyk, and A. Cano


Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.

view more

ROSE: Robust Online Self-Adjusting Ensemble for Continual Learning on Imbalanced Drifting Data Streams

Machine Learning

A. Cano and B. Krawczyk


Data streams are potentially unbounded sequences of instances arriving over time to a classifier. Designing algorithms that are capable of dealing with massive, rapidly arriving information is one of the most dynamically developing areas of machine learning. Such learners must be able to deal with a phenomenon known as concept drift, where the data stream may be subject to various changes in its characteristics over time. Furthermore, distributions of classes may evolve over time, leading to a highly difficult non-stationary class imbalance. In this work we introduce Robust Online Self-Adjusting Ensemble (ROSE), a novel online ensemble classifier capable of dealing with all of the mentioned challenges. The main features of ROSE are: (1) online training of base classifiers on variable size random subsets of features; (2) online detection of concept drift and creation of a background ensemble for faster adaptation to changes; (3) sliding window per class to create skew-insensitive classifiers regardless of the current imbalance ratio; and (4) self-adjusting bagging to enhance the exposure of difficult instances from minority classes. The interplay among these features leads to an improved performance in various data stream mining benchmarks. An extensive experimental study comparing with 30 ensemble classifiers shows that ROSE is a robust and well-rounded classifier for drifting imbalanced data streams, especially under the presence of noise and class imbalance drift, while maintaining competitive time complexity and memory consumption. Results are supported by a thorough non-parametric statistical analysis.

view more

Kappa Updated Ensemble for Drifting Data Stream Mining

Machine Learning

A. Cano and B. Krawczyk


Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream mining have gained significant popularity, due to their high predictive capabilities and effective mechanisms for alleviating concept drift. In this paper, we propose a new ensemble method named Kappa Updated Ensemble (KUE). It is a combination of online and block-based ensemble approaches that uses Kappa statistic for dynamic weighting and selection of base classifiers. In order to achieve a higher diversity among base learners, each of them is trained using a different subset of features and updated with new instances with given probability following a Poisson distribution. Furthermore, we update the ensemble with new classifiers only when they contribute positively to the improvement of the quality of the ensemble. Finally, each base classifier in KUE is capable of abstaining itself for taking a part in voting, thus increasing the overall robustness of KUE. An extensive experimental study shows that KUE is capable of outperforming state-of-the-art ensembles on standard and imbalanced drifting data streams while having a low computational complexity. Moreover, we analyze the use of Kappa vs accuracy to drive the criterion to select and update the classifiers, the contribution of the abstaining mechanism, the contribution of the diversification of classifiers, and the contribution of the hybrid architecture to update the classifiers in an online manner.

view more

Multi-label Punitive kNN with Self-Adjusting Memory for Drifting Data Streams

ACM Transactions on Knowledge Discovery from Data

M. Roseberry, B. Krawczyk, and A. Cano


In multi-label learning, data may simultaneously belong to more than one class. When multi-label data arrives as a stream, the challenges associated with multi-label learning are joined by those of data stream mining, including the need for algorithms that are fast and flexible, able to match both the speed and evolving nature of the stream. This paper presents a punitive k nearest neighbors algorithm with a self-adjusting memory (MLSAMPkNN) for multi-label, drifting data streams. The memory adjusts in size to contain only the current concept and a novel punitive system identifies and penalizes errant data examples early, removing them from the window. By retaining and using only data that are both current and beneficial, MLSAMPkNN is able to adapt quickly and efficiently to changes within the data stream while still maintaining a low computational complexity. Additionally, the punitive removal mechanism offers increased robustness to various data-level difficulties present in data streams, such as class imbalance and noise. The experimental study compares the proposal to 24 algorithms using 30 real-world and 15 artificial multi-label data streams on six multi-label metrics, evaluation time, and memory consumption. The superior performance of the proposed method is validated through non-parametric statistical analysis, proving both high accuracy and low time complexity. MLSAMPkNN is a versatile classifier, capable of returning excellent performance in diverse stream scenarios.

view more

Evolving Rule-Based Classifiers with Genetic Programming on GPUs for Drifting Data Streams

Pattern Recognition

A. Cano and B. Krawczyk


Designing efficient algorithms for mining massive high-speed data streams has become one of the contemporary challenges for the machine learning community. Such models must display highest possible accuracy and ability to swiftly adapt to any kind of changes, while at the same time being characterized by low time and memory complexities. However, little attention has been paid to designing learning systems that will allow us to gain a better understanding of incoming data. There are few proposals on how to design interpretable classifiers for drifting data streams, yet most of them are characterized by a significant trade-off between accuracy and interpretability. In this paper, we show that it is possible to have all of these desirable properties in one model. We introduce ERulesD2S: evolving rule-based classifier for drifting data Streams. By using grammar-guided genetic programming, we are able to obtain accurate sets of rules per class that are able to adapt to changes in the stream without a need for an explicit drift detector. Additionally, we augment our learning model with new proposals for rule propagation and data stream sampling, in order to maintain a balance between learning and forgetting of concepts. To improve efficiency of mining massive and non-stationary data, we implement ERulesD2S parallelized on GPUs. A thorough experimental study on 30 datasets proves that ERulesD2S is able to efficiently adapt to any type of concept drift and outperform state-of-the-art rule-based classifiers, while using small number of rules. At the same time ERulesD2S is highly competitive to other single and ensemble learners in terms of accuracy and computational complexity, while offering fully interpretable classification rules. Additionally, we show that ERulesD2S can scale-up efficiently to high-dimensional data streams, while offering very fast update and classification times. Finally, we present the learning capabilities of ERulesD2S for sparsely labeled data streams.

view more

Interpretable Multi-view Early Warning System adapted to Underrepresented Student Populations

IEEE Transactions on Learning Technologies

A. Cano and J.D. Leonard


Early warning systems have been progressively implemented in higher education institutions to predict student performance. However, they usually fail at effectively integrating the many information sources available at universities to make more accurate and timely predictions, they often lack decision-making reasoning to motivate the reasons behind the predictions, and they are generally biased toward the general student body, ignoring the idiosyncrasies of underrepresented student populations (determined by socio-demographic factors such as race, gender, residency, or status as a freshmen, transfer, adult, or first-generation students) that traditionally have greater difficulties and performance gaps. This paper presents a multiview early warning system built with comprehensible Genetic Programming classification rules adapted to specifically target underrepresented and underperforming student populations. The system integrates many student information repositories using multiview learning to improve the accuracy and timing of the predictions. Three interfaces have been developed to provide personalized and aggregated comprehensible feedback to students, instructors, and staff to facilitate early intervention and student support. Experimental results, validated with statistical analysis, indicate that this multiview learning approach outperforms traditional classifiers. Learning outcomes will help instructors and policy-makers to deploy strategies to increase retention and improve academics.

view more

A survey on graphic processing unit computing for large-scale data mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

A. Cano


General purpose computation using Graphic Processing Units (GPUs) is a well‐established research area focusing on high‐performance computing solutions for massively parallelizable and time‐consuming problems. Classical methodologies in machine learning and data mining cannot handle processing of massive and high‐speed volumes of information in the context of the big data era. GPUs have successfully improved the scalability of data mining algorithms to address significantly larger dataset sizes in many application areas. The popularization of distributed computing frameworks for big data mining opens up new opportunities for transformative solutions combining GPUs and distributed frameworks. This survey analyzes current trends in the use of GPU computing for large‐scale data mining, discusses GPU architecture advantages for handling volume and velocity of data, identifies limitation factors hampering the scalability of the problems, and discusses open issues and future directions.

view more

Distributed Nearest Neighbor Classification for Large-Scale Multi-label Data on Spark

Future Generation Computer Systems

J. Gonzalez-Lopez, S. Ventura, and A. Cano


Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor ( Ml-knn). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed Ml-knn implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables Ml-knn to scale efficiently with respect to the size of the problem.

view more

MIRSVM: Multi-Instance Support Vector Machine with Bag Representatives

Pattern Recognition

G. Melki, A. Cano, and S. Ventura


Multiple-instance learning (MIL) is a variation of supervised learning, where samples are represented by labeled bags, each containing sets of instances. The individual labels of the instances within a bag are unknown, and labels are assigned based on a multi-instance assumption. One of the major complexities associated with this type of learning is the ambiguous relationship between a bag’s label and the instances it contains. This paper proposes a novel support vector machine (SVM) multiple-instance formulation and presents an algorithm with a bag-representative selector that trains the SVM based on bag-level information, named MIRSVM. The contribution is able to identify instances that highly impact classification, i.e. bag-representatives, for both positive and negative bags, while finding the optimal class separation hyperplane. Unlike other multi-instance SVM methods, this approach eliminates possible class imbalance issues by allowing both positive and negative bags to have at most one representative, which constitute as the most contributing instances to the model. The experimental study evaluates and compares the performance of this proposal against 11 state-of-the-art multi-instance methods over 15 datasets, and the results are validated through non-parametric statistical analysis. The results indicate that bag-based learners outperform the instance-based and wrapper methods, as well as MIRSVM’s overall superior performance against other multi-instance SVM models, having an average accuracy of 82.6%, which is 2.5% better than the best performing state-of-the-art MI classifier.

view more