Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences

2011.08.17 Academic report：Human Action Recognition

Data：16-08-2011 | 【 A A A 】 | 【Print】【Close】

Speaker： Mubarak Shah（FIEEE、Agere Chair Professor of University of Central Florida, USA）

Time： 9:30 am , August 17, 2011 (Wednesday)

Venue: Conference Room of Optical Image Analysis and Learning Center (OPTIMAL)

Speaker Profile：

　　　　Dr. Mubarak Shah, Agere Chair Professor of Computer Science, is the founding director of the Computer Visions Lab at UCF. He is a co-author of three books (Motion-Based Recognition (1997), Video Registration (2003), and Automated Multi-Camera Surveillance: Algorithms and Practice (2008)), all by Springer. He has published extensively on topics related to visual surveillance, tracking, human activity and action recognition, object detection and categorization, shape from shading, geo registration, photo realistic synthesis, visual crowd analysis, bio medical imaging, etc. Dr. Shah is a fellow of IEEE, IAPR, AAAS and SPIE. In 2006, he was awarded the Pegasus Professor award, the highest award at UCF, given to a faculty member who has made a significant impact on the university, has made an extraordinary contribution to the university community, and has demonstrated excellence in teaching, research and service. He is ACM Distinguished Speaker. He was an IEEE Distinguished Visitor speaker for 1997-2000, and received IEEE Outstanding Engineering Educator Award in 1997. He received the Harris Corporation's Engineering Achievement Award in 1999, the TOKTEN awards from UNDP in 1995, 1997, and 2000; Teaching Incentive Program awards in 1995 and 2003, Research Incentive Award in 2003 and 2009, Millionaires' Club awards in 2005, 2006, and 2009, University Distinguished Researcher award in 2007, SANA award in 2007, an honorable mention for the ICCV 2005 Where Am I? Challenge Problem, and was nominated for the best paper award in ACM Multimedia Conference in 2005. He is an editor of international book series on Video Computing; editor in chief of Machine Vision and Applications journal, and an associate editor of ACM Computing Surveys journal. He was an associate editor of the IEEE Transactions on PAMI, and a guest editor of the special issue of International Journal of Computer Vision on Video Computing. He was the program co-chair of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

Summary：

　　　　In this talk, I will give an overview of three recent approaches we have proposed for recognizing human actions. First, I will present an approach that employs concepts from chaotic systems theory to model and analyze nonlinear dynamics of human actions. Trajectories of human joints are used as a representation of the non-linear dynamical systems that are generating the particular action of interest. Each trajectory is then used to reconstruct a phase space by employing a delay-embedding scheme. The properties of the reconstructed phase space are captured in terms of chaotic invariants: Lyapunov exponent, correlation integral and correlation dimension. Finally, a feature vector that is a combination of these invariants over all trajectories of joints represents the action.

　　　　Several approaches to action recognition, including the above approach, require detection of objects, followed by tracking of objects from frame to frame. Recently, in computer vision there has been a lot of interest in the bag of video words (mid level features) approach, which bypasses the object detection and tracking steps. Second, I will present an approach for learning a visual vocabulary for human action recognition using diffusion maps embedding. Ourapproach is inspired by the conjecture that midlevel features, in the hierarchy of appearance features, produced by similar sources lie on a certain manifold. The goal then is to embed these midlevel features into a lower dimensional, semantically meaningful space, while retaining the geometric structure in terms of similarity between features. Finally, clustering in this low dimensional space forms high level feature groups.

　　　　Action recognition approaches, including the above two approaches, suffer from many drawbacks in practice, which include (1) the inability to cope with incremental recognition problems; (2) the requirement of an intensive training stage to obtain good performance; (3) the inability to recognize simultaneous multiple actions; and (4) difficulty in performing recognition frame by frame. Thirdly, in order to overcome these drawbacks using a single approach, I will present a novel framework involving the feature-tree to index large scale motion features using the Sphere/Rectangle-tree (SR-tree). The recognition consists of the following two steps: (1) recognizing the local features by non-parametric nearest neighbor (NN), and (2) using a simple voting strategy to label the action.