I am PhD student in the Perceptual User Interfaces Group since October 2013.
contact
education
I revived my Master degree from Beihang University (2013) and completed my Bachelor degree from the Honors Program, China Agriculture University, Beijing, China (2010). I was a research intern at Institute of Automation, Chinese Academy of Sciences during 2011 to 2013 under the supervision of Prof. Stan Z. Li.
research interests
I am interested in human-computer interaction based on computer vision.
awards
- Best paper honorable mention award, ACM Symposium on User Interface Software and Technology (UIST 2017)
- Best paper honorable mention award, ACM Symposium on User Interface Software and Technology (UIST 2016)
- 10/2013–5/2015, International Max Planck Research School for Computer Science (IMPRS-CS) Scholarship, Germany
- 2011, First-class Graduate Scholarship, Beihang University, China
- 2008, Second-class Undergraduate Grant, China Agricultural University, China
external activities
Reviewer for Journals: TPAMI 2017
Reviewer for Conferences: ETRA 2016, UIST 2016, Augmented Human 2017, IWBF 2017, UIST 2017, Multimedia 2017
publications
![]() | Xucong Zhang; Yusuke Sugano; Mario Fritz; Andreas Bulling MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation Journal Article IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41 (1), pp. 162-175, 2019. @article{zhang18_pami, title = {MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation}, author = {Xucong Zhang and Yusuke Sugano and Mario Fritz and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2018/04/zhang18_pami.pdf}, doi = {10.1109/TPAMI.2017.2778103}, year = {2019}, date = {2019-01-01}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, volume = {41}, number = {1}, pages = {162-175}, abstract = {Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22% (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22% (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation. |
Xucong Zhang Gaze Estimation and Interaction in Real-World Environments PhD Thesis 2018. @phdthesis{Zhang18_dissertation, title = {Gaze Estimation and Interaction in Real-World Environments}, author = {Xucong Zhang}, url = {https://wp.mpi-inf.mpg.de/perceptual/files/2018/09/Dissertation_Xucong-Zhang.pdf}, year = {2018}, date = {2018-09-28}, abstract = {Human eye gaze has been widely used in human-computer interaction, as it is a promising modality for natural, fast, pervasive, and non-verbal interaction between humans and computers. As the foundation of gaze-related interactions, gaze estimation has been a hot research topic in recent decades. In this thesis, we focus on developing appearance-based gaze estimation methods and corresponding attentive user interfaces with a single webcam for challenging real-world environments. First, we collect a large-scale gaze estimation dataset, MPIIGaze, the first of its kind, outside of controlled laboratory conditions. Second, we propose an appearance-based method that, in stark contrast to a long-standing tradition in gaze estimation, only takes the full face image as input. Third, we study data normalisation for the first time in a principled way, and propose a modification that yields significant performance improvements. Fourth, we contribute an unsupervised detector for human-human and human- object eye contact. Finally, we study personal gaze estimation with multiple personal devices, such as mobile phones, tablets, and laptops.}, keywords = {}, pubstate = {published}, tppubtype = {phdthesis} } Human eye gaze has been widely used in human-computer interaction, as it is a promising modality for natural, fast, pervasive, and non-verbal interaction between humans and computers. As the foundation of gaze-related interactions, gaze estimation has been a hot research topic in recent decades. In this thesis, we focus on developing appearance-based gaze estimation methods and corresponding attentive user interfaces with a single webcam for challenging real-world environments. First, we collect a large-scale gaze estimation dataset, MPIIGaze, the first of its kind, outside of controlled laboratory conditions. Second, we propose an appearance-based method that, in stark contrast to a long-standing tradition in gaze estimation, only takes the full face image as input. Third, we study data normalisation for the first time in a principled way, and propose a modification that yields significant performance improvements. Fourth, we contribute an unsupervised detector for human-human and human- object eye contact. Finally, we study personal gaze estimation with multiple personal devices, such as mobile phones, tablets, and laptops. | |
![]() | Xucong Zhang; Yusuke Sugano; Andreas Bulling Revisiting Data Normalization for Appearance-Based Gaze Estimation Inproceedings Proc. International Symposium on Eye Tracking Research and Applications (ETRA), pp. 12:1-12:9, 2018. @inproceedings{zhang18_etra, title = {Revisiting Data Normalization for Appearance-Based Gaze Estimation}, author = {Xucong Zhang and Yusuke Sugano and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2018/04/zhang18_etra.pdf https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/gaze-based-human-computer-interaction/revisiting-data-normalization-for-appearance-based-gaze-estimation/}, doi = {10.1145/3204493.3204548}, year = {2018}, date = {2018-03-28}, booktitle = {Proc. International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {12:1-12:9}, abstract = {Appearance-based gaze estimation is promising for unconstrained real-world settings, but the significant variability in head pose and user-camera distance poses significant challenges for training generic gaze estimators. Data normalization was proposed to cancel out this geometric variability by mapping input images and gaze labels to a normalized space. Although used successfully in prior works, the role and importance of data normalization remains unclear. To fill this gap, we study data normalization for the first time using principled evaluations on both simulated and real data. We propose a modification to the current data normalization formulation by removing the scaling factor and show that our new formulation performs significantly better (between 9.5% and 32.7%) in the different evaluation settings. Using images synthesized from a 3D face model, we demonstrate the benefit of data normalization for the efficiency of the model training. Experiments on real-world images confirm the advantages of data normalization in terms of gaze estimation performance.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Appearance-based gaze estimation is promising for unconstrained real-world settings, but the significant variability in head pose and user-camera distance poses significant challenges for training generic gaze estimators. Data normalization was proposed to cancel out this geometric variability by mapping input images and gaze labels to a normalized space. Although used successfully in prior works, the role and importance of data normalization remains unclear. To fill this gap, we study data normalization for the first time using principled evaluations on both simulated and real data. We propose a modification to the current data normalization formulation by removing the scaling factor and show that our new formulation performs significantly better (between 9.5% and 32.7%) in the different evaluation settings. Using images synthesized from a 3D face model, we demonstrate the benefit of data normalization for the efficiency of the model training. Experiments on real-world images confirm the advantages of data normalization in terms of gaze estimation performance. |
![]() | Philipp Müller; Michael Xuelin Huang; Xucong Zhang; Andreas Bulling Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour Inproceedings Proc. International Symposium on Eye Tracking Research and Applications (ETRA), pp. 31:1-31:10, 2018. @inproceedings{mueller18_etra, title = {Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour}, author = {Philipp Müller and Michael Xuelin Huang and Xucong Zhang and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2018/04/mueller18_etra.pdf}, doi = {10.1145/3204493.3204549}, year = {2018}, date = {2018-03-28}, booktitle = {Proc. International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {31:1-31:10}, abstract = {Eye contact is one of the most important non-verbal social cues and fundamental to human interactions. However, detecting eye contact without specialized eye tracking equipment poses significant challenges, particularly for multiple people in real-world settings. We present a novel method to robustly detect eye contact in natural three- and four-person interactions using off-the-shelf ambient cameras. Our method exploits that, during conversations, people tend to look at the person who is currently speaking. Harnessing the correlation between people's gaze and speaking behaviour therefore allows our method to automatically acquire training data during deployment and adaptively train eye contact detectors for each target user. We empirically evaluate the performance of our method on a recent dataset of natural group interactions and demonstrate that it achieves a relative improvement over the state-of-the-art method of more than 60%, and also improves over a head pose based baseline.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Eye contact is one of the most important non-verbal social cues and fundamental to human interactions. However, detecting eye contact without specialized eye tracking equipment poses significant challenges, particularly for multiple people in real-world settings. We present a novel method to robustly detect eye contact in natural three- and four-person interactions using off-the-shelf ambient cameras. Our method exploits that, during conversations, people tend to look at the person who is currently speaking. Harnessing the correlation between people's gaze and speaking behaviour therefore allows our method to automatically acquire training data during deployment and adaptively train eye contact detectors for each target user. We empirically evaluate the performance of our method on a recent dataset of natural group interactions and demonstrate that it achieves a relative improvement over the state-of-the-art method of more than 60%, and also improves over a head pose based baseline. |
![]() | Seonwook Park; Xucong Zhang; Andreas Bulling; Otmar Hilliges Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings Inproceedings Proc. International Symposium on Eye Tracking Research and Applications (ETRA), pp. 21:1-21:10, 2018, (best presentation award). @inproceedings{park18_etra, title = {Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings}, author = {Seonwook Park and Xucong Zhang and Andreas Bulling and Otmar Hilliges}, url = {https://perceptual.mpi-inf.mpg.de/files/2018/04/park18_etra.pdf https://youtu.be/I8WlEHgDBV4}, doi = {10.1145/3204493.3204545}, year = {2018}, date = {2018-03-27}, booktitle = {Proc. International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {21:1-21:10}, abstract = {Conventional feature-based and model-based gaze estimation methods have proven to perform well in settings with controlled illumination and specialized cameras. In unconstrained real-world settings, however, such methods are surpassed by recent appearance-based methods due to difficulties in modeling factors such as illumination changes and other visual artifacts. We present a novel learning-based method for eye region landmark localization that enables conventional methods to be competitive to latest appearance-based methods. Despite having been trained exclusively on synthetic data, our method exceeds the state of the art for iris localization and eye shape registration on real-world imagery. We then use the detected landmarks as input to iterative model-fitting and lightweight learning-based gaze estimation methods. Our approach outperforms existing model-fitting and appearance-based methods in the context of person-independent and personalized gaze estimation.}, note = {best presentation award}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Conventional feature-based and model-based gaze estimation methods have proven to perform well in settings with controlled illumination and specialized cameras. In unconstrained real-world settings, however, such methods are surpassed by recent appearance-based methods due to difficulties in modeling factors such as illumination changes and other visual artifacts. We present a novel learning-based method for eye region landmark localization that enables conventional methods to be competitive to latest appearance-based methods. Despite having been trained exclusively on synthetic data, our method exceeds the state of the art for iris localization and eye shape registration on real-world imagery. We then use the detected landmarks as input to iterative model-fitting and lightweight learning-based gaze estimation methods. Our approach outperforms existing model-fitting and appearance-based methods in the context of person-independent and personalized gaze estimation. |
![]() | Xucong Zhang; Michael Xuelin Huang; Yusuke Sugano; Andreas Bulling Training Person-Specific Gaze Estimators from Interactions with Multiple Devices Inproceedings Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 624:1-624:12, 2018. @inproceedings{zhang18_chi, title = {Training Person-Specific Gaze Estimators from Interactions with Multiple Devices}, author = {Xucong Zhang and Michael Xuelin Huang and Yusuke Sugano and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2018/02/zhang18_chi.pdf}, doi = {10.1145/3173574.3174198}, year = {2018}, date = {2018-01-01}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, journal = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, pages = {624:1-624:12}, abstract = {Learning-based gaze estimation has significant potential to enable attentive user interfaces and gaze-based interaction on the billions of camera-equipped handheld devices and ambient displays. While training accurate person- and device-independent gaze estimators remains challenging, person-specific training is feasible but requires tedious data collection for each target device. To address these limitations, we present the first method to train person-specific gaze estimators across multiple devices. At the core of our method is a single convolutional neural network with shared feature extraction layers and device-specific branches that we train from face images and corresponding on-screen gaze locations. Detailed evaluations on a new dataset of interactions with five common devices (mobile phone, tablet, laptop, desktop computer, smart TV) and three common applications (mobile game, text editing, media center) demonstrate the significant potential of cross-device training. We further explore training with gaze locations derived from natural interactions, such as mouse or touch input.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Learning-based gaze estimation has significant potential to enable attentive user interfaces and gaze-based interaction on the billions of camera-equipped handheld devices and ambient displays. While training accurate person- and device-independent gaze estimators remains challenging, person-specific training is feasible but requires tedious data collection for each target device. To address these limitations, we present the first method to train person-specific gaze estimators across multiple devices. At the core of our method is a single convolutional neural network with shared feature extraction layers and device-specific branches that we train from face images and corresponding on-screen gaze locations. Detailed evaluations on a new dataset of interactions with five common devices (mobile phone, tablet, laptop, desktop computer, smart TV) and three common applications (mobile game, text editing, media center) demonstrate the significant potential of cross-device training. We further explore training with gaze locations derived from natural interactions, such as mouse or touch input. |
![]() | Xucong Zhang; Yusuke Sugano; Andreas Bulling Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery Inproceedings Proc. of the ACM Symposium on User Interface Software and Technology (UIST), pp. 193-203, 2017, (best paper honourable mention award). @inproceedings{zhang17_uist, title = {Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery}, author = {Xucong Zhang and Yusuke Sugano and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2017/05/zhang17_uist.pdf https://www.youtube.com/watch?v=ccrS5XuhQpk https://www.youtube.com/watch?v=AxDHU40Xda8 http://www.techbriefs.com/component/content/article/1198-tb/news/news/27400-new-software-spots-eye-contact}, doi = {10.1145/3126594.3126614}, year = {2017}, date = {2017-06-26}, booktitle = {Proc. of the ACM Symposium on User Interface Software and Technology (UIST)}, pages = {193-203}, abstract = {Eye contact is an important non-verbal cue in social signal processing and promising as a measure of overt attention in human-object interactions and attentive user interfaces. However, robust detection of eye contact across different users, gaze targets, camera positions, and illumination conditions is notoriously challenging. We present a novel method for eye contact detection that combines a state-of-the-art appearance-based gaze estimator with a novel approach for unsupervised gaze target discovery, i.e. without the need for tedious and time-consuming manual data annotation. We evaluate our method in two real-world scenarios: detecting eye contact at the workplace, including on the main work display, from cameras mounted to target objects, as well as during everyday social interactions with the wearer of a head-mounted egocentric camera. We empirically evaluate the performance of our method in both scenarios and demonstrate its effectiveness for detecting eye contact independent of target object type and size, camera position, and user and recording environment.}, note = {best paper honourable mention award}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Eye contact is an important non-verbal cue in social signal processing and promising as a measure of overt attention in human-object interactions and attentive user interfaces. However, robust detection of eye contact across different users, gaze targets, camera positions, and illumination conditions is notoriously challenging. We present a novel method for eye contact detection that combines a state-of-the-art appearance-based gaze estimator with a novel approach for unsupervised gaze target discovery, i.e. without the need for tedious and time-consuming manual data annotation. We evaluate our method in two real-world scenarios: detecting eye contact at the workplace, including on the main work display, from cameras mounted to target objects, as well as during everyday social interactions with the wearer of a head-mounted egocentric camera. We empirically evaluate the performance of our method in both scenarios and demonstrate its effectiveness for detecting eye contact independent of target object type and size, camera position, and user and recording environment. |
![]() | Xucong Zhang; Yusuke Sugano; Mario Fritz; Andreas Bulling It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation Inproceedings Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2299-2308, 2017. @inproceedings{zhang17_cvprw, title = {It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation}, author = {Xucong Zhang and Yusuke Sugano and Mario Fritz and Andreas Bulling}, url = {https://wp.mpi-inf.mpg.de/perceptual/files/2017/11/zhang_cvprw2017-6.pdf https://perceptual.mpi-inf.mpg.de/research/datasets/#zhang17_cvprw}, doi = {10.1109/CVPRW.2017.284}, year = {2017}, date = {2017-05-18}, booktitle = {Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, pages = {2299-2308}, abstract = {Eye gaze is an important non-verbal cue for human affect analysis. Recent gaze estimation work indicated that information from the full face region can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through extensive evaluation, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and par- ticularly pronounced for the most challenging extreme head poses.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Eye gaze is an important non-verbal cue for human affect analysis. Recent gaze estimation work indicated that information from the full face region can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through extensive evaluation, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and par- ticularly pronounced for the most challenging extreme head poses. |
![]() | Yusuke Sugano; Xucong Zhang; Andreas Bulling AggreGaze: Collective Estimation of Audience Attention on Public Displays Inproceedings Proc. of the ACM Symposium on User Interface Software and Technology (UIST), pp. 821-831, 2016, (best paper honourable mention award). @inproceedings{sugano16_uist, title = {AggreGaze: Collective Estimation of Audience Attention on Public Displays}, author = {Yusuke Sugano and Xucong Zhang and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2016/09/sugano16_uist.pdf https://www.youtube.com/watch?v=eFK39S_lgdg http://s2017.siggraph.org/acm-siggraph-organization-events/sessions/uist-reprise-siggraph-2017}, doi = {10.1145/2984511.2984536}, year = {2016}, date = {2016-06-26}, booktitle = {Proc. of the ACM Symposium on User Interface Software and Technology (UIST)}, pages = {821-831}, abstract = {Gaze is frequently explored in public display research given its importance for monitoring and analysing audience attention. However, current gaze-enabled public display interfaces require either special-purpose eye tracking equipment or explicit personal calibration for each individual user. We present AggreGaze, a novel method for estimating spatio-temporal audience attention on public displays. Our method requires only a single off-the-shelf camera attached to the display, does not require any personal calibration, and provides visual attention estimates across the full display. We achieve this by 1) compensating for errors of state-of-the-art appearance-based gaze estimation methods through on-site training data collection, and by 2) aggregating uncalibrated and thus inaccurate gaze estimates of multiple users into joint attention estimates. We propose different visual stimuli for this compensation: a standard 9-point calibration, moving targets, text and visual stimuli embedded into the display content, as well as normal video content. Based on a two-week deployment in a public space, we demonstrate the effectiveness of our method for estimating attention maps that closely resemble ground-truth audience gaze distributions.}, note = {best paper honourable mention award}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Gaze is frequently explored in public display research given its importance for monitoring and analysing audience attention. However, current gaze-enabled public display interfaces require either special-purpose eye tracking equipment or explicit personal calibration for each individual user. We present AggreGaze, a novel method for estimating spatio-temporal audience attention on public displays. Our method requires only a single off-the-shelf camera attached to the display, does not require any personal calibration, and provides visual attention estimates across the full display. We achieve this by 1) compensating for errors of state-of-the-art appearance-based gaze estimation methods through on-site training data collection, and by 2) aggregating uncalibrated and thus inaccurate gaze estimates of multiple users into joint attention estimates. We propose different visual stimuli for this compensation: a standard 9-point calibration, moving targets, text and visual stimuli embedded into the display content, as well as normal video content. Based on a two-week deployment in a public space, we demonstrate the effectiveness of our method for estimating attention maps that closely resemble ground-truth audience gaze distributions. |
![]() | Daniel Pohl; Xucong Zhang; Andreas Bulling Combining Eye Tracking with Optimizations for Lens Astigmatism in Modern Wide-Angle HMDs Inproceedings Proc. of the IEEE Conference on Virtual Reality (VR), pp. 269-270, 2016. @inproceedings{pohl16_vr, title = {Combining Eye Tracking with Optimizations for Lens Astigmatism in Modern Wide-Angle HMDs}, author = {Daniel Pohl and Xucong Zhang and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2016/01/Pohl16_VR.pdf}, doi = {10.1109/VR.2016.7504757}, year = {2016}, date = {2016-03-19}, booktitle = {Proc. of the IEEE Conference on Virtual Reality (VR)}, pages = {269-270}, abstract = {Virtual Reality has hit the consumer market with affordable head-mounted displays. When using these, it quickly becomes apparent that the resolution of the built-in display panels still needs to be highly increased. To overcome the resulting higher performance demands, eye tracking can be used for foveated rendering. However, as there are lens distortions in HMDs, there are more possibilities to increase the performance with smarter rendering approaches. We present a new system using optimizations for rendering considering lens astigmatism and combining this with foveated rendering through eye tracking. Depending on the current eye gaze, this delivers a rendering speed-up of up to 20%.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Virtual Reality has hit the consumer market with affordable head-mounted displays. When using these, it quickly becomes apparent that the resolution of the built-in display panels still needs to be highly increased. To overcome the resulting higher performance demands, eye tracking can be used for foveated rendering. However, as there are lens distortions in HMDs, there are more possibilities to increase the performance with smarter rendering approaches. We present a new system using optimizations for rendering considering lens astigmatism and combining this with foveated rendering through eye tracking. Depending on the current eye gaze, this delivers a rendering speed-up of up to 20%. |
![]() | Daniel Pohl; Xucong Zhang; Andreas Bulling; Oliver Grau Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User's Current Visual Field Inproceedings Proc. of the 22nd ACM Conference on Virtual Reality Software and Technology (VRST), pp. 323–324, 2016. @inproceedings{pohl16_vrst, title = {Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User's Current Visual Field}, author = { Daniel Pohl and Xucong Zhang and Andreas Bulling and Oliver Grau}, url = {https://perceptual.mpi-inf.mpg.de/wp-content/blogs.dir/12/files/2016/11/pohl2016_vrst.pdf}, doi = {10.1145/2993369.2996300}, year = {2016}, date = {2016-01-01}, booktitle = {Proc. of the 22nd ACM Conference on Virtual Reality Software and Technology (VRST)}, pages = {323--324}, abstract = {With increasing spatial and temporal resolution in head-mounted displays (HMDs), using eye trackers to adapt rendering to the user is getting important to handle the rendering workload. Besides using methods like foveated rendering, we propose to use the current visual field for rendering, depending on the eye gaze. We use two effects for performance optimizations. First, we noticed a lens defect in HMDs, where depending on the distance of the eye gaze to the center, certain parts of the screen towards the edges are not visible anymore. Second, if the user looks up, he cannot see the lower parts of the screen anymore. For the invisible areas, we propose to skip rendering and to reuse the pixels colors from the previous frame. We provide a calibration routine to measure these two effects. We apply the current visual field to a renderer and get up to 2x speed-ups.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } With increasing spatial and temporal resolution in head-mounted displays (HMDs), using eye trackers to adapt rendering to the user is getting important to handle the rendering workload. Besides using methods like foveated rendering, we propose to use the current visual field for rendering, depending on the eye gaze. We use two effects for performance optimizations. First, we noticed a lens defect in HMDs, where depending on the distance of the eye gaze to the center, certain parts of the screen towards the edges are not visible anymore. Second, if the user looks up, he cannot see the lower parts of the screen anymore. For the invisible areas, we propose to skip rendering and to reuse the pixels colors from the previous frame. We provide a calibration routine to measure these two effects. We apply the current visual field to a renderer and get up to 2x speed-ups. |
![]() | Marc Tonsen; Xucong Zhang; Yusuke Sugano; Andreas Bulling Labelled pupils in the wild: A dataset for studying pupil detection in unconstrained environments Inproceedings Proc. of the 9th ACM International Symposium on Eye Tracking Research & Applications (ETRA 2016), pp. 139-142, 2016. @inproceedings{tonsen16_etra, title = {Labelled pupils in the wild: A dataset for studying pupil detection in unconstrained environments}, author = {Marc Tonsen and Xucong Zhang and Yusuke Sugano and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2016/01/tonsen16_etra.pdf https://perceptual.mpi-inf.mpg.de/research/datasets/#tonsen16_etra}, doi = {10.1145/2857491.2857520}, year = {2016}, date = {2016-01-01}, booktitle = {Proc. of the 9th ACM International Symposium on Eye Tracking Research & Applications (ETRA 2016)}, pages = {139-142}, abstract = {We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people of different ethnicities and a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, and make-up. We bench- mark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution and vision aids as well as recording lo- cation (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people of different ethnicities and a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, and make-up. We bench- mark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution and vision aids as well as recording lo- cation (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers. |
![]() | Xucong Zhang; Yusuke Sugano; Mario Fritz; Andreas Bulling Appearance-Based Gaze Estimation in the Wild Inproceedings Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 4511-4520, 2015. @inproceedings{zhang15_cvpr, title = {Appearance-Based Gaze Estimation in the Wild}, author = {Xucong Zhang and Yusuke Sugano and Mario Fritz and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/files/2015/04/zhang_CVPR15.pdf https://www.youtube.com/watch?v=rw6LZA1USG8 https://perceptual.mpi-inf.mpg.de/research/datasets/#zhang15_cvpr}, doi = {10.1109/CVPR.2015.7299081}, year = {2015}, date = {2015-03-02}, booktitle = {Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015)}, pages = {4511-4520}, abstract = {Appearance-based gaze estimation is believed to work well in real-world settings but existing datasets were collected under controlled laboratory conditions and methods were not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing datasets with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks, which significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation setting. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithm on three current datasets, including our own. This evaluation provides clear insights and allows us identify key research challenges of gaze estimation in the wild.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Appearance-based gaze estimation is believed to work well in real-world settings but existing datasets were collected under controlled laboratory conditions and methods were not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing datasets with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks, which significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation setting. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithm on three current datasets, including our own. This evaluation provides clear insights and allows us identify key research challenges of gaze estimation in the wild. |
![]() | Erroll Wood; Tadas Baltrusaitis; Xucong Zhang; Yusuke Sugano; Peter Robinson; Andreas Bulling Rendering of Eyes for Eye-Shape Registration and Gaze Estimation Inproceedings Proc. of the IEEE International Conference on Computer Vision (ICCV 2015), pp. 3756-3764, 2015. @inproceedings{wood2015_iccv, title = {Rendering of Eyes for Eye-Shape Registration and Gaze Estimation}, author = {Erroll Wood and Tadas Baltrusaitis and Xucong Zhang and Yusuke Sugano and Peter Robinson and Andreas Bulling}, url = {https://perceptual.mpi-inf.mpg.de/wp-content/blogs.dir/12/files/2016/06/wood2015_iccv.pdf http://www.technologyreview.com/view/537891/virtual-eyes-train-deep-learning-algorithm-to-recognize-gaze-direction/ http://www.cl.cam.ac.uk/research/rainbow/projects/syntheseyes/}, doi = {10.1109/ICCV.2015.428}, year = {2015}, date = {2015-01-01}, booktitle = {Proc. of the IEEE International Conference on Computer Vision (ICCV 2015)}, pages = {3756-3764}, abstract = {Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation. Recent large-scale supervised methods for these problems require time-consuming data collection and manual annotation, which can be unreliable. We propose synthesizing perfectly labelled photo-realistic training data in a fraction of the time. We used computer graphics techniques to build a collection of dynamic eye-region models from head scan geometry. These were randomly posed to synthesize close-up eye images for a wide range of head poses, gaze directions, and illumination conditions. We used our model's controllability to verify the importance of realistic illumination and shape variations in eye-region training data. Finally, we demonstrate the benefits of our synthesized training data (SynthesEyes) by out-performing state-of-the-art methods for eye-shape registration as well as cross-dataset appearance-based gaze estimation in the wild.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation. Recent large-scale supervised methods for these problems require time-consuming data collection and manual annotation, which can be unreliable. We propose synthesizing perfectly labelled photo-realistic training data in a fraction of the time. We used computer graphics techniques to build a collection of dynamic eye-region models from head scan geometry. These were randomly posed to synthesize close-up eye images for a wide range of head poses, gaze directions, and illumination conditions. We used our model's controllability to verify the importance of realistic illumination and shape variations in eye-region training data. Finally, we demonstrate the benefits of our synthesized training data (SynthesEyes) by out-performing state-of-the-art methods for eye-shape registration as well as cross-dataset appearance-based gaze estimation in the wild. |
![]() | Junjie Yan; Xucong Zhang; Zhen Lei; Shengcai Liao; Stan Z Li Robust multi-resolution pedestrian detection in traffic scenes Inproceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2013, pp. 3033–3040, IEEE 2013. @inproceedings{yan2013robust, title = {Robust multi-resolution pedestrian detection in traffic scenes}, author = {Junjie Yan and Xucong Zhang and Zhen Lei and Shengcai Liao and Stan Z Li}, url = {https://perceptual.mpi-inf.mpg.de/files/2015/03/Yan_CVPR13.pdf}, year = {2013}, date = {2013-01-01}, booktitle = { IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2013}, pages = {3033--3040}, organization = {IEEE}, abstract = {The serious performance decline with decreasing resolution is the major bottleneck for current pedestrian detection techniques. In this paper, we take pedestrian detection in different resolutions as different but related problems, and propose a Multi-Task model to jointly consider their commonness and differences. The model contains resolution aware transformations to map pedestrians in different resolutions to a common space, where a shared detector is constructed to distinguish pedestrians from background. For model learning, we present a coordinate descent procedure to learn the resolution aware transformations and deformable part model (DPM) based detector iteratively. In traffic scenes, there are many false positives located around vehicles, therefore, we further build a context model to suppress them according to the pedestrian-vehicle relationship. The context model can be learned automatically even when the vehicle annotations are not available. Our method reduces the mean miss rate to 60% for pedestrians taller than 30 pixels on the Caltech Pedestrian Benchmark, which noticeably outperforms previous state-of-the-art (71%).}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The serious performance decline with decreasing resolution is the major bottleneck for current pedestrian detection techniques. In this paper, we take pedestrian detection in different resolutions as different but related problems, and propose a Multi-Task model to jointly consider their commonness and differences. The model contains resolution aware transformations to map pedestrians in different resolutions to a common space, where a shared detector is constructed to distinguish pedestrians from background. For model learning, we present a coordinate descent procedure to learn the resolution aware transformations and deformable part model (DPM) based detector iteratively. In traffic scenes, there are many false positives located around vehicles, therefore, we further build a context model to suppress them according to the pedestrian-vehicle relationship. The context model can be learned automatically even when the vehicle annotations are not available. Our method reduces the mean miss rate to 60% for pedestrians taller than 30 pixels on the Caltech Pedestrian Benchmark, which noticeably outperforms previous state-of-the-art (71%). |
![]() | Junjie Yan; Xucong Zhang; Zhen Lei; Stan Z. Li Structural face detection Inproceedings 10th Automatic Face and Gesture Recognition (FG), 2013. @inproceedings{Yan_FG13, title = {Structural face detection}, author = {Junjie Yan and Xucong Zhang and Zhen Lei and Stan Z. Li}, url = {https://perceptual.mpi-inf.mpg.de/files/2015/03/yan_FG2013.pdf}, year = {2013}, date = {2013-01-01}, booktitle = {10th Automatic Face and Gesture Recognition (FG)}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
![]() | Junjie Yan; Xucong Zhang; Zhen Lei; Stan Z. Li Real-time high performance deformable model for face detection in the wild Inproceedings International Conference on Biometrics (ICB), pp. 1–6, IEEE 2013. @inproceedings{yan2013real, title = {Real-time high performance deformable model for face detection in the wild}, author = { Junjie Yan and Xucong Zhang and Zhen Lei and Stan Z. Li}, url = {https://perceptual.mpi-inf.mpg.de/files/2015/03/Yan_Vis13.pdf}, year = {2013}, date = {2013-01-01}, booktitle = {International Conference on Biometrics (ICB)}, pages = {1--6}, organization = {IEEE}, abstract = {We present an effective deformable part model for face detection in the wild. Compared with previous systems on face detection, there are mainly three contributions. The first is an efficient method for calculating histogram of oriented gradients by pre-calculated lookup tables, which only has read and write memory operations and the feature pyramid can be calculated in real-time. The second is a Sparse Constrained Latent Bilinear Model to simultaneously learn the discriminative deformable part model, and reduce the feature dimension by sparse transformations for efficient inference. The third contribution is a deformable part based cascade, where every stage is a deformable part in the discriminatively learned model. By integrating the three techniques, we demonstrate noticeable improvements over previous state-of-the-art on FDDB with real-time speed, under widely comparisons with both academic and commercial detectors.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } We present an effective deformable part model for face detection in the wild. Compared with previous systems on face detection, there are mainly three contributions. The first is an efficient method for calculating histogram of oriented gradients by pre-calculated lookup tables, which only has read and write memory operations and the feature pyramid can be calculated in real-time. The second is a Sparse Constrained Latent Bilinear Model to simultaneously learn the discriminative deformable part model, and reduce the feature dimension by sparse transformations for efficient inference. The third contribution is a deformable part based cascade, where every stage is a deformable part in the discriminatively learned model. By integrating the three techniques, we demonstrate noticeable improvements over previous state-of-the-art on FDDB with real-time speed, under widely comparisons with both academic and commercial detectors. |
![]() | Xucong Zhang; Xiaoyun Wang; Yingmin Jia The Visual Internet of Things System Based on Depth Camera Inproceedings Proceedings of the Chinese Intelligent Automation Conference, pp. 447–455, Springer 2013. @inproceedings{zhang2013visual, title = {The Visual Internet of Things System Based on Depth Camera}, author = {Xucong Zhang and Xiaoyun Wang and Yingmin Jia}, url = {https://perceptual.mpi-inf.mpg.de/files/2015/03/Zhang_ICB13.pdf}, year = {2013}, date = {2013-01-01}, booktitle = {Proceedings of the Chinese Intelligent Automation Conference}, pages = {447--455}, organization = {Springer}, abstract = {The Visual Internet of Things is an important part of information technology. It is proposed to strength the system with atomic visual label by taking visual camera as the sensor. Unfortunately, the traditional color camera is greatly influenced by the condition of illumination, and suffers from the low detection accuracy. To solve that problem, we build a new Visual Internet of Things with depth camera. The new system takes advantage of the illumination invariant of depth information and rich texture of color information to label the objects in the scene. We use Kinect as the sensor to get the color and depth information of the scene, modify the traditional computer vision technology for the combinatorial information to label target object, and return the result to user interface. We set up the hardware platform and the real application validates the robust and high precision of the system. }, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The Visual Internet of Things is an important part of information technology. It is proposed to strength the system with atomic visual label by taking visual camera as the sensor. Unfortunately, the traditional color camera is greatly influenced by the condition of illumination, and suffers from the low detection accuracy. To solve that problem, we build a new Visual Internet of Things with depth camera. The new system takes advantage of the illumination invariant of depth information and rich texture of color information to label the objects in the scene. We use Kinect as the sensor to get the color and depth information of the scene, modify the traditional computer vision technology for the combinatorial information to label target object, and return the result to user interface. We set up the hardware platform and the real application validates the robust and high precision of the system. |
![]() | Zhang, Xucong; Yan, Junjie; Feng, Shikun; Lei, Zhen; Yi, Dong; Li, Stan Z Water filling: Unsupervised people counting via vertical kinect sensor Inproceedings Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on, pp. 215–220, IEEE 2012. @inproceedings{zhang2012water, title = {Water filling: Unsupervised people counting via vertical kinect sensor}, author = {Zhang, Xucong and Yan, Junjie and Feng, Shikun and Lei, Zhen and Yi, Dong and Li, Stan Z}, url = {https://perceptual.mpi-inf.mpg.de/files/2015/03/WaterFilling.pdf}, year = {2012}, date = {2012-01-01}, booktitle = {Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on}, pages = {215--220}, organization = {IEEE}, abstract = {People counting is one of the key components in video surveillance applications, however, due to occlusion, illumination, color and texture variation, the problem is far from being solved. Different from traditional visible camera based systems, we construct a novel system that uses vertical Kinect sensor for people counting, where the depth information is used to remove the affect of the appearance variation. Since the head is always closer to the Kinect sensor than other parts of the body, people counting task equals to find the suitable local minimum regions. According to the particularity of the depth map, we propose a novel unsupervised water filling method that can find these regions with the property of robustness, locality and scale-invariance. Experimental comparisons with mean shift and random forest on two databases validate the superiority of our water filling algorithm in people counting.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } People counting is one of the key components in video surveillance applications, however, due to occlusion, illumination, color and texture variation, the problem is far from being solved. Different from traditional visible camera based systems, we construct a novel system that uses vertical Kinect sensor for people counting, where the depth information is used to remove the affect of the appearance variation. Since the head is always closer to the Kinect sensor than other parts of the body, people counting task equals to find the suitable local minimum regions. According to the particularity of the depth map, we propose a novel unsupervised water filling method that can find these regions with the property of robustness, locality and scale-invariance. Experimental comparisons with mean shift and random forest on two databases validate the superiority of our water filling algorithm in people counting. |