美国阿肯色中央大学盛胜利博士学术报告

来源:信息科学与技术学院  作者:陈帆  日期:2018-07-27  点击数:230
美国阿肯色中央大学盛胜利博士学术报告

报告人:盛胜利(Victor S. Sheng)(美国阿肯色中央大学)
时   间:2018年7月30日(星期一)上午9:30
地   点:犀浦校区9号楼X322
主持人:陈  帆
 
Title:  Data Quality and Data Mining with Crowdsourcing
 
Abstract: Crowdsourcing systems provide convenient platforms to collect human intelligence for a variety of tasks (e.g., labeling objects) from a vast pool of independent workers (a crowd). Compared with traditional expert labeling methods, crowdsourcing is obviously more efficient and cost-effective, but the quality of a single labeler cannot be guaranteed. In taking advantage of the low cost of crowdsourcing, it is common to obtain multiple labels per object (i.e., repeated labeling) from the crowd. In this talk, we outline our research on crowdsourcing from three aspects: (1) crowdsourcing mechanisms, specifically on repeated labeling strategies; (2) ground truth inference, specifically on noise correction after inference and biased wisdom of the crowd; and (3) learning from crowdsourced data. 

We first present repeated-labeling strategies of increasing complexity to obtain multiple labels. Repeatedly labeling a carefully chosen set of points is generally preferable. A robust technique that combines different notions of uncertainty to select data points for more labels is recommended. Recent research on crowdsourcing focuses on deriving an integrated label from multiple noisy labels via expectation-maximization based (EM-based) ground truth inference. We present a novel framework that introduces noise correction techniques to further improve the label quality of the integrated labels obtained after ground truth inference. We further show that biased labeling is a systematic tendency. State-of-the-art ground truth inference algorithms cannot handle the biased labeling issue very well. Our simple consensus algorithm performs much better. Finally, we present pairwise solutions for maximizing the utility of multiple noisy labels for learning. Pairwise solutions can completely avoid the potential bias introduced in ground truth inference. They have both sides (potential correct and incorrectisy information) considered, so that they have very good performance whenever there are a few or many labels available.

 
个人简介:盛胜利(Victor S. Sheng)目前是美国阿肯色中央大学计算机科学系副教授(终生教职)、数据分析实验室主任,主要研究领域为数据挖掘与机器学习、人工智能、数据安全和决策支持,及其在商业、生物信息学、医疗信息学、软件工程等领域的应用。盛胜利在上述领域的研究解决了应用中的实际问题,开辟了一些全新的研究方向,处于该领域国际前沿。盛胜利主持或共同主持美国自然科学基金、加拿大自然科学与工程研究基金等12项。在数据挖掘和机器学习的顶级期刊如TPAMI、TKDE、JMLR、TMM、TNNLS和DMKD等,以及顶级会议如IJCAI、KDD、ICML、AAAI、ECML、ICDM、DASFAA、ACM MM、ICMR、ICME、CIKM等上面共发表研究成果论文130余篇,其中CCF推荐的A类期刊和会议论文30余篇,谷歌学术统计单篇论文被引用最高达820余次,累计被引用3890余次。2018年荣获ICCCS大会最佳论文奖和杰出贡献奖; 2015年荣获WISE最佳学生论文奖Finalist;2011年荣获ICDM大会最佳论文奖;2008年荣获KDD大会最佳论文奖亚军;2008年机器学习研讨会Google学生奖;2006年荣获IEEE Kitchener-Waterloo知识和数据挖掘联合研讨会最佳海报奖。研究成果已经运用到软件开发,医疗诊断, 和在线情感分析中,并与世界一流公司合作,实现技术产业化或是联合研究,如与Sun Microsystems合作将成果转化为软件产品,以及获得IBM公司资助开展数据质量研究等。