Zhu, Deyao 朱德尧

Welcome! I am a PhD candidate at KAUST, where I work on Multimodal Large Language Model, Prediction Model, and Reinforcement Learning advised by Mohamed Elhoseiny. I'm a Hokkien Chinese from Quanzhou.

Email  /  CV  /  Google Scholar  /  GitHub  /  Linkedin

profile photo
Research

My research interests lie in AGI. In particular, I am interested in designing multimodal language models that can make decision. This includes Reinforcement Learning, Video & Language Understanding, Planning via Large Language Model, Motion Forecasting, and other related deep learning topics.

MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
Deyao Zhu*, Jun Chen*, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
Preprint
arXiv / code / model / dataset / website / demo / video

MiniGPT-4 shows that the secret behind the next-level vision-language-ability of GPT-4 can be simply a more powerful LLM. By aligning open-sourced vision and advanced language models together, MiniGPT-4 reproduces many GPT-4's vision-related demo.

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, Mohamed Elhoseiny
Preprint
arXiv / code

Video ChatCaptioner creates comprehensive spatiotemporal video descriptions by letting ChatGPT to select the video frame it want to know and ask questions to BLIP-2. ChatGPT at the end summarizes all the information from BLIP-2 as the final video description.

ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny
Preprint
arXiv / code

We discover the powerful questioning ability of modern LLMs. We use it to enrich the image caption of BLIP-2 by prompting ChatGPT to keep asking informative questions to BLIP-2 and summarize the conversation at the end as the final caption.

Exploring Open-Vocabulary Semantic Segmentation without Human Labels
Jun Chen, Deyao Zhu, Guochen Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana
Preprint
arXiv

ZeroSeg,a novel method that leverages the existing pretrained vision-language(VL) model to train open-vocabulary zero-shot semantic segmentation models

Guiding Online Reinforcement Learning with Action-Free Offline Pretraining
Deyao Zhu, Yuhui Wang, Jürgen Schmidhuber, Mohamed Elhoseiny
Preprint
arXiv / code

Extract knowledge from datasets without action labels to help online reinforcement learning by pretraining an Action-Free Decision Transformer to form intrinsic rewards.

Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning
Deyao Zhu, Li Erran Li, Mohamed Elhoseiny
ICLR, 2023
openreview / arXiv / code

Applying RL methods on a graph world model instead of the original complex environment simplifies the policy learning.

Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation
Abduallah Mohamed, Deyao Zhu, Warren Vu, Mohamed Elhoseiny Christian Claudel,
ECCV, 2022
arXiv / code / demo

A better metric for trajectory prediction that consider the whole prediction distribution.

RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition
Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, Mohamed Elhoseiny
CVPR, 2022
arXiv/ code

Modeling an effective message-passing flow through an attention mechanism can be critical to tackling the compositionality and long-tail challenges in visual relationship recognition.

Motion Forecasting with Unlikelihood Training in Continuous Space
Deyao Zhu, Mohamed Zahran, Li Erran Li, Mohamed Elhoseiny
CoRL, 2021   (Oral Presentation)
openreview / code

Reducing the likelihood of the context-violating predictions directly in the predicted distribution improves the prediction quality.

HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents
Deyao Zhu, Mohamed Zahran, Li Erran Li, Mohamed Elhoseiny
ICLR, 2021
openreview / video

Hallucinating surrounding vehicles' driving intents helps model predict better.

Learning to Disentangle Latent Physical Factors for Video Prediction
Deyao Zhu, Marco Munderloh, Bodo Rosenhahn, Jörg Stückler
GCPR, 2019
openreview / code / video

Reducing the total correlation of the latent feature's dimensions to learn a physically disentangled representation of blocks.

Misc
Third-Place in Habitat Rearrangement Challenge 2022
Reviewer in TPAMI, CoRL 2022, ECCV 2022, AAAI 2023, CVPR 2023
Teaching Assistant in CS283 Deep Generative Model and CS326 Low Resource Deep Learning

Template