Project:
Improving Data Quality and Data Mining Using Noisy Micro-Outsourcing (NSF IIS-1115417)
PI: Victor S. Sheng (ssheng@uca.edu), University of Central Arkansas
This project explores novel techniques that take advantage of the low cost of micro-outsourcing using systems such as Amazon's mechanical Turk, to engage a large number of workers from around the world for acquiring the labels of instances to be used to construct the training corpus. There is currently little understanding of how to utilize the multiple noisy labels obtained using micro-outsourcing. There is a need for advanced techniques for taking advantage of the low cost of micro-outsourcing in order to improve data quality and the quality of models built from the available data. It explores novel approaches for utilizing multiple labels given to an instance by different labelers. It also extends active learning techniques for active selection of samples to be labeled to take into account the multi-sets of labels that have been already obtained from a pool of labelers.
Advances in techniques for active selection of data instances to be labeled in a micro-outsourcing setting can significantly improve the quality of data used to build predictive models in a broad range of applications, including gene annotation, image annotation, text classification, sentiment analysis, and recommender systems, where unlabeled data are plentiful yet labeled data are sparse. The project will provide research opportunities for students at University of Central Arkansas, a primarily undergraduate institution and help expand the STEM pipeline.