2011.03.24 Academic Report: Active Labeling, Cleansing, and Concept Summarization for Data Streams

Date: Mar 23, 2011
   Speaker: Xingquan ZHU  Professor of University of Technology, Sydney
   Time: 9:30 am, March 24, 2011
   Venue: Conference Room of Optical Image Analysis and Learning Center (OPTIMAL)

  Speaker Profile: 
   Xingquan Zhu is a full professor at the Centre for Quantum Computation and Intelligent Systems, University of Technology, Sydney. He is the recipient the Australia ARC Future Fellowship (Level 2). He received his B.S. and M.S. degrees in Communication Engineering from Xidian University, and his PhD degree in Computer Science from Fudan University, Shanghai China, in 2001. He was a Postdoctoral Associate with the Department of Computer Science, Purdue University, West Lafayette, USA, from 2001 to 2002, a Research Assistant Professor with the Department of Computer Science, University of Vermont, USA, from 2002 to 2006, a tenure track Assistant Professor with the Department of Computer Science and Engineering, Florida Atlantic University, USA, from 2006 to 2009, and an Associate Professor at the Faculty of Engineering and Information Technology, University of Technology, Sydney, from 2009 to 2010. Since 2000, he has published more than 110 referred journal and conference proceeding papers. His research mainly focuses on data mining, machine learning, multimedia systems, and bioinformatics. Dr. Zhu has been an Associate Editor of the IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) since 2009 and is a Program Committee Co-chairs for the 23rd IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2011), the 9th International Conference on Machine Learning and Applications (ICMLA 2010). He also serves (or served) as program vice chairs, finance chairs, publicity co-chairs, program committee members for many international conferences, including KDD, ICDM, and CKIM.

  Summary: 
   This talk will summarize three steam data mining problems: active labeling, cleansing, and concept summarization, that we have addressed in recent years. The eventual goal is to build accurate classification and summarization models from large volume stream data with limited labeling efforts. For active labeling, we consider that labeling all stream data is expensive and impractical, so our objective is to label a small portion of stream data from which an accurate prediction model can be derived to predict future instances. For incorrectly labeled training samples in data streams, we propose a Maximum Variance Margin principle to identify and remove mislabeled data, so that the prediction models built from cleansed stream data are more accurate than those trained from raw data. For vague learning in data streams, we allow users to label instance groups, instead of single instances, as positive samples for learning, and our essential goal is to restore the concepts behind the labeling process. Experimental results on synthetic and real-world data demonstrate the performances of the proposed approaches, in comparison with other simple methods.


Download: