Biography
The vision of Kris Kitani's lab is to realize robust autonomous systems built for real-world perception and interactive decision-making. The key focus areas of his lab are perception, decision-making, and interaction. His focus is broad because he believes that innovating at the system-level requires expertise and integration at the component level.
Kitani's lab innovates across the full spectrum of perception, including vision-based human pose estimation, action recognition, object detection/tracking/forecasting, and 3D scene understanding. They formulate models for decision-making, including approaches such as reinforcement learning, inverse reinforcement learning, imitation learning, and game-theoretic modeling. They develop real-world systems to enable cyber-physical interaction, including wearable camera systems, multi-modal sensors, portable navigational aides, and assistive mobile robots.
Areas of Expertise
Media
Social
Education
The University of Tokyo
Ph.D.
Information and Communication Engineering
The University of Tokyo
M.S.
Information and Communication Engineering
University of Southern California
B.S.
Electrical Engineering
Links
Languages
- Japanese
- English
Patents
Method And System For Generating Pedestrian-Vehicle Interaction Data For Training An Autonomous Vehicle Inventors
18829746
2026-03-12
A method and system for generating virtual pedestrian-vehicle interaction data includes generating a virtual reality environment in virtual reality device, generating a scenario in the virtual reality environment, the scenario comprising virtual vehicle movements, displaying the scenario in a virtual reality device, storing virtual reality movements relative to the scenario, the virtual reality movements comprising at least a yaw movement, communicating the virtual vehicle movements to a simulator controller, communicating the virtual vehicle movements to the simulator controller, associating the virtual reality movements, the virtual vehicle movements and the scenario to form pedestrian-vehicle data, and training an autonomous vehicle system using the pedestrian-vehicle data.
Method for diverse sequential point cloud forecasting
19383503
2026-03-05
A method for sequential point cloud forecasting is described. The method includes training a vector-quantized conditional variational autoencoder (VQ-CVAE) framework to map an output to a closest vector in a discrete latent space to obtain a future latent space. The method also includes outputting, by a trained VQ-CVAE, a categorical distribution of a probability of V vectors in a discrete latent space in response to an input previously sampled latent space and past point cloud sequences. The method further includes sampling an inferred future latent space from the categorical distribution of the probability of the V vectors in the discrete latent space. The method also includes predicting a future point cloud sequence according to the inferred future latent space and the past point cloud sequences. The method further includes denoising, by a denoising diffusion probabilistic model (DDPM), the predicted future point cloud sequences according to an added noise.
Articles
Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
arXiv preprint arXiv:2603.16233Ryosuke Hori, Jyun-Ting Song, Zhengyi Luo, Jinkun Cao, Soyong Shin, Hideo Saito, Kris Kitani
2026-03-27
We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
arXiv preprint arXiv:2603.03265Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer
2026-03-03
We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/


