Areas of Expertise (6)
Computer Security and Intrusion Detection Systems
Big Data Analytics
Biomedical and Health Informatics
Virginia Polytechnic Institute and State University: Ph.D.
Selected Media Appearances (3)
Artificial intelligence could be 'game changer' in detecting, managing Alzheimer's disease
"Machine learning has an inherent capacity to reveal meaningful patterns and insights from a large, complex inter-dependent array of clinical determinants and the ability to continue to 'learn' from ongoing utility of practical predictive models," said Taghi Khoshgoftaar, Ph.D., co-author and Motorola Professor in FAU's Department of Computer and Electrical Engineering and Computer Science. "Seamless use and real-time interpretation will enhance case management and patient care through innovative technology and practical and readily usable integrated clinical applications that could be developed into a hand-held device and app."...
Scientists teach machines to predict recovery time from sports-related concussions
"We have introduced a cutting-edge approach and new clinical tool to manage sports-related concussions, which will measurably improve with more and more inclusive data," said Taghi Khoshgoftaar, Ph.D., co-author and Motorola professor in FAU's Department of Computer and Electrical Engineering and Computer Science, who collaborated with lead author Michael F. Bergeron, Ph.D., senior vice president of development and applications at SIVOTEC Analytics, and Sara Landset, co-author and a Ph.D. student at FAU. "Our supervised machine learning method has demonstrated efficacy and warrants further exploration."...
Artificial Intelligence Holds Promise in Detecting Home Health Medicare Fraud
Home Health Care News
The team applied algorithms to detect patterns of fraud in the Centers for Medicare & Medicaid Services (CMS) data because “patterns in the data are hidden from us” as humans, said Taghi Khoshgoftaar, Florida Atlantic University director of Data Mining and Machine Learning Lab in the Department of Computer and Electrical Engineering and Computer Science...
Selected Articles (5)
T Hasanin, TM Khoshgoftaar, JL Leevy, N Seliya
High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false negatives is costlier than false positives, may result in adverse consequences. Our paper presents two case studies, each utilizing a unique, combined approach of Random Undersampling and Feature Selection to investigate the effect of class imbalance on big data analytics. Random Undersampling is used to generate six class distributions ranging from balanced to moderately imbalanced, and Feature Importance is used as our Feature Selection method. Classification performance was reported for the Random Forest, Gradient-Boosted Trees, and Logistic Regression learners, as implemented within the Apache Spark framework. The first case study utilized a training dataset and a test dataset from the ECBDL’14 bioinformatics competition. The training and test datasets contain about 32 million instances and 2.9 million instances, respectively. For the first case study, Gradient-Boosted Trees obtained the best results, with either a features-set of 60 or the full set, and a negative-to-positive ratio of either 45:55 or 40:60. The second case study, unlike the first, included training data from one source (POST dataset) and test data from a separate source (Slowloris dataset), where POST and Slowloris are two types of Denial of Service attacks. The POST dataset contains about 1.7 million instances, while the Slowloris dataset contains about 0.2 million instances. For the second case study, Logistic Regression obtained the best results, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. We conclude that combining Feature Selection with Random Undersampling improves the classification performance of learners with imbalanced big data from different application domains.
G Castaneda, P Morris, TM Khoshgoftaar
This study investigates the effectiveness of multiple maxout activation function variants on 18 datasets using Convolutional Neural Networks. A network with maxout activation has a higher number of trainable parameters compared to networks with traditional activation functions. However, it is not clear if the activation function itself or the increase in the number of trainable parameters is responsible in yielding the best performance for different entity recognition tasks. This paper investigates if an increase in the number of convolutional filters on traditional activation functions performs equal-to or better-than maxout networks. Our experiments compare the Rectified Linear Unit, Leaky Rectified Linear Unit, Scaled Exponential Linear Unit, and Hyperbolic Tangent activations to four maxout function variants. We observe that maxout networks train relatively slower than networks with traditional activation functions, e.g. Rectified Linear Unit. In addition, we found that on average, across all datasets, the Rectified Linear Unit activation function performs better than any maxout activation when the number of convolutional filters is increased. Furthermore, adding more filters enhances the classification accuracy of the Rectified Linear Unit networks, without adversely affecting their advantage over maxout activations with respect to network-training speed.
CL Calvert, TM Khoshgoftaar
The integrity of modern network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are consistently shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. In recent years, attackers have begun to focus their attack efforts on the application layer, allowing them to produce attacks that can exploit known issues within specific application protocols. Slow HTTP Denial of Service attacks are one such attack variant, which targets the HTTP protocol and can imitate legitimate user traffic in order to deny resources from a service. Successful mitigation of this attack type requires network analysts to evaluate large quantities of network traffic to identify and block intrusive traffic. The issue, is that the number of legitimate traffic instances can far outnumber the amount of attack instances, making detection problematic. Machine learning techniques can be used to aid in detection, but the large level of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results. In this work, we evaluate the use of data sampling to produce varying class distributions in order to counteract the effects of severely imbalanced Slow HTTP DoS big datasets. We also detail our process for collecting real-world representative Slow HTTP DoS attack traffic from a live network environment to create our datasets. Five class distributions are generated to evaluate the Slow HTTP DoS detection performance of eight machine learning techniques. Our results show that the optimal learner and class distribution combination is that of Random Forest with a 65:35 distribution ratio, obtaining an AUC value of 0.99904. Further, we determine through the use of significance testing, that the use of sampling techniques can significantly increase learner performance when detecting Slow HTTP DoS attack traffic.
Journal of Big Data
In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC’s programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node’s best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform’s Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.
X Su, TM Khoshgoftaar
As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, model-based, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.