Project: Improving Data Quality and Data Mining Using Noisy Micro-Outsourcing (NSF IIS-1115417)

PI: Victor S. Sheng (ssheng@uca.edu), University of Central Arkansas

Abstract

Machine learning currently offers one of the most cost-effective approaches to building predictive models (e.g., classifiers for categorizing the millions of messages, news articles, and blogs that are generated every day). However, the effective use of machine learning methods in such settings is limited by the availability of a training corpus (i.e., a representative set of instances that have been labeled with the correponding categories). In domains where labeled data are scarce or expensive to acquire, there is an urgent need for cost-effective approaches to selectively acquiring labels for data samples used to train predictive models using machine learning.

This project explores novel techniques that take advantage of the low cost of micro-outsourcing using systems such as Amazon's mechanical Turk, to engage a large number of workers from around the world for acquiring the labels of instances to be used to construct the training corpus. There is currently little understanding of how to utilize the multiple noisy labels obtained using micro-outsourcing. There is a need for advanced techniques for taking advantage of the low cost of micro-outsourcing in order to improve data quality and the quality of models built from the available data. It explores novel approaches for utilizing multiple labels given to an instance by different labelers. It also extends active learning techniques for active selection of samples to be labeled to take into account the multi-sets of labels that have been already obtained from a pool of labelers.

Advances in techniques for active selection of data instances to be labeled in a micro-outsourcing setting can significantly improve the quality of data used to build predictive models in a broad range of applications, including gene annotation, image annotation, text classification, sentiment analysis, and recommender systems, where unlabeled data are plentiful yet labeled data are sparse. The project will provide research opportunities for students at University of Central Arkansas, a primarily undergraduate institution and help expand the STEM pipeline.

Publications

  • Sheng, V.S., Provost, F., Simple Multiple Noisy Label Utilization Strategies, Proceedings of the 2011 IEEE International Conference on Data Mining, December 11-14, Vancouver, Canada. To appear. (Regular paper acceptance rate: 12.3%).
  • Sheng, V.S., Studying Active Learning in the Cost-sensitive Framework, Proceedings of the 45th Hawaii International Conference on System Sciences (HICSS-45), January 4-7, 2012, Grand Wailea, Maui, Hawaii, USA. To appear.
  • Sheng, V.S., Example Labeling Difficulty within Repeated Labeling, Proceedings of the 7th International Conference on Data Mining (DMIN11), 301-307, August 18-21, Las Vegas, Nevada, USA. (Acceptance rate: 24%).
  • Sheng, V.S., Tada, R., Atla, A., An Empirical Study of Noise Impacts on Supervised Learning Algorithms and Measures, Proceedings of the 7th International Conference on Data Mining (DMIN11), 266-272, August 18-21, Las Vegas, Nevada, USA. (Acceptance rate: 24%).
  • Sheng, V.S., Fast Data Acquisition in Cost-Sensitive Learning, Proceedings of the 11th Industrial Conference on Data Mining (ICDM11), 66-77, New York. (Best Paper Award). (Acceptance rate: less than 24%)
  • Sheng, V.S., Tada, R., Boosting Inspired Process for Improving AUC, Proceedings of the 7th International Conference on Machine Learning and Data Mining (MLDM11), 199-209, New York. (Acceptance rate: less than 26%).
Join the Project

Please send your resume, transcripts, and your personal statement with your schedule to Dr. Sheng at ssheng@uca.edu. You can also welcome to drop by his office MSCT313 to discuss this project. We will train you first and pay you based on your contributions.