Dr. Jaroslaw Szlichta
Assistant Professor, Faculty of Science | University of Ontario Institute of Technology
Oshawa, ON, CA
Award-winning big data analytics expert examines big data cleaning to improve accuracy of predictors and trends
With an infinite amount of data swirling around vast global networks, big data analytics is exploding not only as a means to process and understand abundant information, but as a key method for predicting trends in social and economic behaviour. While data availability continues to gain rapid speed, the challenge lies in ensuring its accuracy.
Human error produces ‘dirty data’ which triggers incorrect analytics and leads to inaccurate business decisions. Dr. Jaroslaw Szlichta, an Assistant Professor in the Faculty of Science at the University of Ontario Institute of Technology (UOIT) is focused on data analytics, business intelligence and big data cleaning. His latest research aims to improve the rate of clean data, which would significantly improve data accuracy, and lead to more precise data analytics predictions and trends.
Awarded a post-doctoral fellowship by Mitacs Elevate in 2014, Dr. Szlichta’s research focused on big data integration and continuous data cleaning. He developed an algorithm to automatically integrate and clean all data before any analytics were performed to ensure more accurate outcomes. In 2013, Dr. Szlichta was appointed post-doctoral fellow in the Department of Computer Science at the University of Toronto, before joining UOIT in July 2014. He brings award-winning, big data analytics expertise to the university and has developed an undergraduate course on the subject.
Applying his interest in math to computer science, Dr. Szlichta earned his Master of Science in Engineering from the Faculty of Electronics and Information Science at the Warsaw University of Technology in Warsaw Poland in 2009; and received his Doctorate in Computer Science from the Department of Computer Science and Engineering at York University in Toronto, Ontario in 2013. During his doctoral studies, he was appointed a three-year research fellowship at the IBM Centre for Advanced Studies (CAS) in Markham, Ontario; and in 2012, he received the IBM CAS Research Student of the Year Award.
A former software developer for Comarch Research & Development in Warsaw, he developed the WYSIWYG reporting system OCEAN GenRap, a novel data analytics reporting solution. Recognized for his collaborative work, Dr. Szlichta received the prestigious CeBIT Business Award. He is also a member of the Big Data Benchmark Community, a global community group aimed at developing a data set that may be used as a benchmark for evaluating research.
Industry Expertise (3)
- Information Technology and Services
Areas of Expertise (13)
Mitacs Elevate Post-doctoral Fellowship Program (professional)
Awarded $57,000 over one year to support his research, Dr. Szlichta focused on big data integration and continuous data cleaning.
Post-doctoral Fellow, Department of Computer Science, University of Toronto (professional)
Appointed post-doctoral fellow to continue his research into big data analytics, and data cleaning.
IBM CAS Research Student of the Year Award (professional)
Awarded to a student who has shown outstanding insight and perspective that has contributed to IBM in a matter of great importance. During his research fellowship, Dr. Szlichta worked closely with IBM on order dependencies in databases, and proved to be a key resource in developing a prototype to exploit order optimization in DB2, and to optimize date predicates using generated subqueries.
IBM CAS Research Fellowship (professional)
During his doctoral studies, Dr. Szlichta was appointed to a three-year research fellowship and awarded $102,000 within this highly competitive worldwide program, which honours exceptional doctoral students who have an interest in solving problems that are important in practice (and to IBM) and fundamental to innovation in many academic disciples and areas of study.
2007 CeBIT Business Award (professional)
Awarded to Dr. Szlichta for his collaborative work in designing and implementing OCEAN GenRap system, an innovative data analytics reporting solution. CeBIT is the world's largest international computer exhibition.
York University: PhD, Computer Science 2013
Warsaw University of Technology: MSE, Computer Science 2009
Event Appearances (7)
Expressiveness and Complexity of Order Dependencies
University of Waterloo, Database Research Group Meeting Waterloo, Ontario
Fundamentals of Order Optimization
Invited Talk, University of Waterloo Waterloo, Ontario
Fundamentals and Applications of Order Dependencies
University of Toronto DB Seminar Toronto, Ontario
Chasing Order Dependencies
Invited Talk, Carleton University Ottawa, Ontario
Applications for Order Dependencies in IBM DB2
Invited Talk, IBM Research Almaden San Jose, California
Optimizing Business-Intelligence Queries in DB2 with Order Dependencies
Invited Talk, AT&T Labs New Jersey, United States
Queries With Dates
Invited Talk, Warsaw University of Technology Warsaw, Poland
Research Grants (1)
Big Data Cleaning
NSERC Discovery Grant $90000
As primary investigator, Dr. Szlichta's five-year, international research project focuses on big data cleaning in partnership with the University of Waterloo, Ontario, AT&T in New York, and IBM CAS in Markham, Ontario.
Computers and Media
1200U, 1st Year, Undergraduate Course (Elective)
Software Design and Analysis
2040U, 2nd Year, Undergraduate Course
Database Systems and Concepts
3030U, 3rd Year, Undergraduate Course
Big Data Analytics
4030U, 4th Year, Undergraduate Course
Advanced Topics in Information Science
CSCI 6720G, Graduate Course
SIGMOD '14, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
This research demonstrates MeanKS, a new system for meaningful keyword search over relational databases. The system first captures the user's interest by determining the roles of the keywords. Then, it uses schema-based ranking to rank join trees that cover the keyword roles. This uses the relevance of relations and foreign-key relationships in the schema over the information content of the database.
2014 IEEE 30th International Conference on Data Engineering (ICDE)
In declarative data cleaning, data semantics are encoded as constraints and errors arise when the data violates the constraints. This research introduces a continuous data cleaning framework that can be applied to dynamic data and constraint environments. The approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence to date. Importantly, the approach uses not only the data and constraints as evidence, but also considers the past repairs chosen and applied by a user (user repair preferences).
Journal Proceedings of the VLDB Endowment
Dependencies play an important role in databases. Order dependencies (ODs)--and unidirectional order dependencies (UODs), a proper sub-class of ODs--which describe the relationships among lexicographical orderings of sets of tuples are studied. Lexicographical ordering is considered, as by the order-by operator in SQL, because this is the notion of order used in SQL and within query optimization. The main goal is to investigate the inference problem for ODs, both in theory and in practice. We show the usefulness of ODs in query optimization.
Journal Proceedings of the VLDB Endowment
Dependencies have played a significant role in database design for many years. They have also been shown to be useful in query optimization. This paper discusses dependencies between lexicographically ordered sets of tuples. It introduces formally the concept of order dependency and presents a set of axioms (inference rules) for them. Additionally, it shows how query rewrites based on these axioms can be used for query optimization.
Proceedings of the 14th International Conference on Extending Database Technology
Data warehouses are repositories of electronically stored data which are designed to support reporting and analysis. The analysis of historical data often involves aggregation over time. Thus, time is critical in the design of a data warehouse. This research describes novel techniques for storing date information and optimization of queries that reference the date dimension. It shows how to embed intelligence into the date key and how to exploit monotonic dependencies. This research presents the value of these techniques for the improvement of performance when combined with partitioning and indexes.