Alberto Cano is Assistant Professor in the Department of Computer Science at the Virginia Commonwealth University, USA, where he heads the High-Performance Data Mining Lab. He was previously a researcher at the University of Córdoba, Spain, as a member of the Knowledge Discovery and Intelligent Systems research group. His research is focused on soft computing, machine learning, data mining, general-purpose computing on graphics processing units (GPGPU) and parallel computing.
Industry Expertise (1)
Areas of Expertise (5)
University of Granada, Spain: Ph.D., Computer Science 2014
University of Cordoba, Spain: M.Sc., Computer Science 2013
University of Granada, Spain: M.Sc., Computer Science 2011
Soft Computing and Intelligent Systems
University of Cordoba, Spain: B.Sc., Computer Science 2010
Selected Articles (5)
Association rule mining is one of the most common data mining techniques used to identify and describe interesting relationships between patterns from large datasets, the frequency of an association being defined as the number of transactions that it satisfies. In situations where each transaction includes an undetermined number of instances (customers shopping habits where each transaction represents a different customer having a varied number of instances), the problem cannot be described as a traditional association rule mining problem. The aim of this work is to discover robust and useful patterns from multiple instance datasets, that is, datasets where each transaction may include an undetermined number of instances. We propose a new problem formulation in the data mining framework: multiple-instance association rule mining. The problem definition, an algorithm to tackle the problem, the application fields, and the relations’ quality measures are formally described. Experimental results reveal the scalability of the problem on different data dimensionality. Finally, we apply it to two real-world applications field: (1) analysis of financial data gathered from one of the most important banks in Lithuania; (2) study of existing relations between records of unemployed gathered from the Spanish public employment service.
Multi-label learning is a challenging task in data mining which has attracted growing attention in recent years. Despite the fact that many multi-label datasets have continuous features, general algorithms developed specially to transform multi-label datasets with continuous attributes’ values into a finite number of intervals have not been proposed to date. Many classification algorithms require discrete values as the input and studies have shown that supervised discretization may improve classification performance. This paper presents a Label-Attribute Interdependence Maximization (LAIM) discretization method for multi-label data. LAIM is inspired in the discretization heuristic of CAIM for single-label classification. The maximization of the label-attribute interdependence is expected to improve labels prediction in data separated through disjoint intervals. The main aim of this paper is to present a discretization method specifically designed to deal with multi-label data and to analyze whether this can improve the performance of multi-label learning methods. To this end, the experimental analysis evaluates the performance of 12 multi-label learning algorithms (transformation, adaptation, and ensemble-based) on a series of 16 multi-label datasets with and without supervised and unsupervised discretization, showing that LAIM discretization improves the performance for many algorithms and measures.
Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (class-attribute interdependence maximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM’s. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage.
The growing interest in data storage has made the data size to be exponentially increased, hampering the process of knowledge discovery from these large volumes of high-dimensional and heterogeneous data. In recent years, many efficient algorithms for mining data associations have been proposed, facing up time and main memory requirements. Nevertheless, this mining process could still become hard when the number of items and records is extremely high. In this paper, the goal is not to propose new efficient algorithms but a new data structure that could be used by a variety of existing algorithms without modifying its original schema. Thus, our aim is to speed up the association rule mining process regardless the algorithm used to this end, enabling the performance of efficient implementations to be enhanced. The structure simplifies, reorganizes, and speeds up the data access by sorting data by means of a shuffling strategy based on the hamming distance, which achieve similar values to be closer, and considering both an inverted index mapping and a run length encoding compression. In the experimental study, we explore the bounds of the algorithms' performance by using a wide number of data sets that comprise either thousands or millions of both items and records. The results demonstrate the utility of the proposed data structure in enhancing the algorithms' runtime orders of magnitude, and substantially reducing both the auxiliary and the main memory requirements.
Early prediction of school dropout is a serious problem in education, but it is not an easy issue to resolve. On the one hand, there are many factors that can influence student retention. On the other hand, the traditional classification approach used to solve this problem normally has to be implemented at the end of the course to gather maximum information in order to achieve the highest accuracy. In this paper, we propose a methodology and a specific classification algorithm to discover comprehensible prediction models of student dropout as soon as possible. We used data gathered from 419 high schools students in Mexico. We carried out several experiments to predict dropout at different steps of the course, to select the best indicators of dropout and to compare our proposed algorithm versus some classical and imbalanced well-known classification algorithms. Results show that our algorithm was capable of predicting student dropout within the first 4–6 weeks of the course and trustworthy enough to be used in an early warning system.