Zhu, Deyao 朱德尧

Welcome! I am a Research Scientist at ByteDance Seed Edge. I am a core contributor to BAGEL BAGEL GitHub stars , and the project lead of MiniGPT-4 MiniGPT-4 GitHub stars . I received my PhD from KAUST, where I was advised by Mohamed Elhoseiny. My doctoral research focused on model-based RL, with an emphasis on sample-efficient learning. My research interests are learning from experience, multimodal LLMs, and reinforcement learning.

Email  /  Google Scholar  /  GitHub  /  Linkedin

profile photo
Selected Publications
BAGEL
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
Preprint, 2025
arXiv / code / model / website / demo

BAGEL is an open-source unified multimodal model for understanding and generation, pretrained on large-scale interleaved text, image, video, and web data.

Causal Diffusion
Causal Diffusion Transformers for Generative Modeling
Chaorui Deng, Deyao Zhu, Kunchang Li, Guang Shi, Haoqi Fan
Preprint, 2024
arXiv

Causal Diffusion introduces a next-token forecasting view of diffusion models and proposes CausalFusion, a decoder-only transformer that factorizes data across sequence positions and diffusion noise levels.

MiniGPT-4
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
Deyao Zhu*, Jun Chen*, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
Preprint
arXiv / code / model / dataset / website / demo / video

MiniGPT-4 shows that the secret behind the next-level vision-language-ability of GPT-4 can be simply a more powerful LLM. By aligning open-sourced vision and advanced language models together, MiniGPT-4 reproduces many GPT-4's vision-related demo.

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, Mohamed Elhoseiny
Preprint
arXiv / code

Video ChatCaptioner creates comprehensive spatiotemporal video descriptions by letting ChatGPT to select the video frame it want to know and ask questions to BLIP-2. ChatGPT at the end summarizes all the information from BLIP-2 as the final video description.

ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny
Preprint
arXiv / code

We discover the powerful questioning ability of modern LLMs. We use it to enrich the image caption of BLIP-2 by prompting ChatGPT to keep asking informative questions to BLIP-2 and summarize the conversation at the end as the final caption.

Exploring Open-Vocabulary Semantic Segmentation without Human Labels
Jun Chen, Deyao Zhu, Guochen Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana
Preprint
arXiv

ZeroSeg,a novel method that leverages the existing pretrained vision-language(VL) model to train open-vocabulary zero-shot semantic segmentation models

Guiding Online Reinforcement Learning with Action-Free Offline Pretraining
Deyao Zhu, Yuhui Wang, Jürgen Schmidhuber, Mohamed Elhoseiny
Preprint
arXiv / code

Extract knowledge from datasets without action labels to help online reinforcement learning by pretraining an Action-Free Decision Transformer to form intrinsic rewards.

Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning
Deyao Zhu, Li Erran Li, Mohamed Elhoseiny
ICLR, 2023
openreview / arXiv / code

Applying RL methods on a graph world model instead of the original complex environment simplifies the policy learning.

Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation
Abduallah Mohamed, Deyao Zhu, Warren Vu, Mohamed Elhoseiny Christian Claudel,
ECCV, 2022
arXiv / code / demo

A better metric for trajectory prediction that consider the whole prediction distribution.

RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition
Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, Mohamed Elhoseiny
CVPR, 2022
arXiv/ code

Modeling an effective message-passing flow through an attention mechanism can be critical to tackling the compositionality and long-tail challenges in visual relationship recognition.

Motion Forecasting with Unlikelihood Training in Continuous Space
Deyao Zhu, Mohamed Zahran, Li Erran Li, Mohamed Elhoseiny
CoRL, 2021   (Oral Presentation)
openreview / code

Reducing the likelihood of the context-violating predictions directly in the predicted distribution improves the prediction quality.

HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents
Deyao Zhu, Mohamed Zahran, Li Erran Li, Mohamed Elhoseiny
ICLR, 2021
openreview / video

Hallucinating surrounding vehicles' driving intents helps model predict better.

Learning to Disentangle Latent Physical Factors for Video Prediction
Deyao Zhu, Marco Munderloh, Bodo Rosenhahn, Jörg Stückler
GCPR, 2019
openreview / code / video

Reducing the total correlation of the latent feature's dimensions to learn a physically disentangled representation of blocks.

Misc
Third-Place in Habitat Rearrangement Challenge 2022
Reviewer in TPAMI, CoRL 2022, ECCV 2022, AAAI 2023, CVPR 2023
Teaching Assistant in CS283 Deep Generative Model and CS326 Low Resource Deep Learning

Template