Kris Kitani profile photo

Kris Kitani

Associate Research Professor Carnegie Mellon University

  • Pittsburgh PA
Contact
Carnegie Mellon University logo

Carnegie Mellon University

View more experts managed by Carnegie Mellon University

Biography

Kris M. Kitani is an associate research professor of the Robotics Institute at Carnegie Mellon University. He received his BS at the University of Southern California and his MS and PhD at the University of Tokyo. His research projects span the areas of computer vision, machine learning and human computer interaction. In particular, his research interests lie at the intersection of first-person vision, human activity modeling and inverse reinforcement learning. His work has been awarded the Marr Prize honorable mention at ICCV 2017, best paper honorable mention at CHI 2017 and CHI 2020, best paper at W4A 2017 and 2019, best application paper ACCV 2014 and best paper honorable mention ECCV 2012.

The vision of Kris Kitani's lab is to realize robust autonomous systems built for real-world perception and interactive decision-making. The key focus areas of his lab are perception, decision-making, and interaction. His focus is broad because he believes that innovating at the system-level requires expertise and integration at the component level.

Kitani's lab innovates across the full spectrum of perception, including vision-based human pose estimation, action recognition, object detection/tracking/forecasting, and 3D scene understanding. They formulate models for decision-making, including approaches such as reinforcement learning, inverse reinforcement learning, imitation learning, and game-theoretic modeling. They develop real-world systems to enable cyber-physical interaction, including wearable camera systems, multi-modal sensors, portable navigational aides, and assistive mobile robots.

Areas of Expertise

Sensing & Perception
Sustainability
Digital Twins
Computer Vision
Artificial Intelligence

Media Appearances

Beep Beep! Suitcase Helps Visually Impaired People Navigate Airports

WESA  radio

2019-05-14

“It’s basic physics,” said Kris Kitani, a head researcher at CMU’s Cognitive Assistance Laboratory and one of BBeep’s creators. “A person is a point on the ground plan. And they’re moving at a certain velocity. And then we can predict, based on that velocity, where they will be at in a few seconds.”

View More

Local Students Create Camera Inside Of A Football

CBS News Pittsburgh  online

2013-02-27

"We stick this camera into a little hole in the football," says researcher Kris Kitani as he held a tiny camera in his hand. "And then, with that, we'd record it, and we'd throw the football. Once you put the camera in there, it's going to be rotating really quickly as the ball is spinning and flying in the sky."

View More

Media

Social

Education

The University of Tokyo

Ph.D.

Information and Communication Engineering

The University of Tokyo

M.S.

Information and Communication Engineering

University of Southern California

B.S.

Electrical Engineering

Languages

  • Japanese
  • English

Patents

Method And System For Generating Pedestrian-Vehicle Interaction Data For Training An Autonomous Vehicle Inventors

18829746

2026-03-12

A method and system for generating virtual pedestrian-vehicle interaction data includes generating a virtual reality environment in virtual reality device, generating a scenario in the virtual reality environment, the scenario comprising virtual vehicle movements, displaying the scenario in a virtual reality device, storing virtual reality movements relative to the scenario, the virtual reality movements comprising at least a yaw movement, communicating the virtual vehicle movements to a simulator controller, communicating the virtual vehicle movements to the simulator controller, associating the virtual reality movements, the virtual vehicle movements and the scenario to form pedestrian-vehicle data, and training an autonomous vehicle system using the pedestrian-vehicle data.

View more

Method for diverse sequential point cloud forecasting

19383503

2026-03-05

A method for sequential point cloud forecasting is described. The method includes training a vector-quantized conditional variational autoencoder (VQ-CVAE) framework to map an output to a closest vector in a discrete latent space to obtain a future latent space. The method also includes outputting, by a trained VQ-CVAE, a categorical distribution of a probability of V vectors in a discrete latent space in response to an input previously sampled latent space and past point cloud sequences. The method further includes sampling an inferred future latent space from the categorical distribution of the probability of the V vectors in the discrete latent space. The method also includes predicting a future point cloud sequence according to the inferred future latent space and the past point cloud sequences. The method further includes denoising, by a denoising diffusion probabilistic model (DDPM), the predicted future point cloud sequences according to an added noise.

View more

Articles

EgoMDM: Diffusion-Based Human Motion Synthesis from Sparse Egocentric Sensors

IEEE Xplore

2026-05-27

Accurate three-dimensional (3D) human motion tracking is essential for immersive augmented reality (AR) and virtual reality (VR) applications, allowing users to engage with virtual environments through realistic full-body avatars. Achieving this level of detail, however, is challenging when the driving signals are sparse, typically coming only from upper-body sensors, such as head-mounted devices and hand controllers. To address this challenge, we propose EgoMDM (Egocentric Motion Diffusion Model), an end-to-end diffusion-based framework designed to reconstruct full-body motion from sparse tracking signals. EgoMDM models human motion in a conditional autoregressive manner using a unidirectional recurrent neural network, making it well-suited for real-time applications.

View more

Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

arXiv preprint

2026-03-27

We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions.

View more

BodyContact4D: A Multi-view Video Dataset for Understanding Human and Environment Interactions

IEEE Xplore

2026-03-20

To improve vision-based methods for understanding how people interact with their physical environment, we introduce a multi-view video and body-contact sensing dataset designed to capture dynamic human activities that involve interactions with the physical environment. The dataset includes activities such as parkour, physical training, and gym exercises, characterized by frequent body-environment contact. The proposed dataset includes 780 K images across 120 K pose sequences from 7 subjects.

View more