|
Zhu, Deyao 朱德尧
Welcome! I am a Research Scientist at ByteDance Seed Edge.
I am a core contributor to BAGEL
,
and the project lead of MiniGPT-4
.
I received my PhD from KAUST,
where I was advised by Mohamed Elhoseiny.
My doctoral research focused on model-based RL, with an emphasis on sample-efficient learning.
My research interests are learning from experience,
multimodal LLMs, and reinforcement learning.
Email  / 
Google Scholar  / 
GitHub  / 
Linkedin
|
|
|
|
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng,
Deyao Zhu,
Kunchang Li,
Chenhui Gou,
Feng Li,
Zeyu Wang,
Shu Zhong,
Weihao Yu,
Xiaonan Nie,
Ziang Song,
Guang Shi,
Haoqi Fan
Preprint, 2025
arXiv /
code /
model /
website /
demo
BAGEL is an open-source unified multimodal model for understanding and generation,
pretrained on large-scale interleaved text, image, video, and web data.
|
|
|
Causal Diffusion Transformers for Generative Modeling
Chaorui Deng,
Deyao Zhu,
Kunchang Li,
Guang Shi,
Haoqi Fan
Preprint, 2024
arXiv
Causal Diffusion introduces a next-token forecasting view of diffusion models and proposes
CausalFusion, a decoder-only transformer that factorizes data across sequence positions and
diffusion noise levels.
|
|
|
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
Deyao Zhu*,
Jun Chen*,
Xiaoqian Shen,
Xiang Li,
Mohamed Elhoseiny
Preprint
arXiv /
code /
model /
dataset /
website /
demo /
video
MiniGPT-4 shows that the secret behind the next-level vision-language-ability of GPT-4 can be simply
a more powerful LLM. By aligning open-sourced vision and advanced language models together,
MiniGPT-4 reproduces many GPT-4's vision-related demo.
|
|
|
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen,
Deyao Zhu,
Kilichbek Haydarov,
Xiang Li,
Mohamed Elhoseiny
Preprint
arXiv /
code
Video ChatCaptioner creates comprehensive spatiotemporal video descriptions by letting ChatGPT to
select the video frame it want to know and ask questions to BLIP-2.
ChatGPT at the end summarizes all the information from BLIP-2 as the final video description.
|
|
|
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Deyao Zhu,
Jun Chen,
Kilichbek Haydarov,
Xiaoqian Shen,
Wenxuan Zhang,
Mohamed Elhoseiny
Preprint
arXiv /
code
We discover the powerful questioning ability of modern LLMs.
We use it to enrich the image caption of BLIP-2 by prompting ChatGPT to keep asking informative
questions to BLIP-2 and summarize the conversation at the end as the final caption.
|
|
|
Exploring Open-Vocabulary Semantic Segmentation without Human Labels
Jun Chen,
Deyao Zhu,
Guochen Qian,
Bernard Ghanem,
Zhicheng Yan,
Chenchen Zhu,
Fanyi Xiao,
Mohamed Elhoseiny,
Sean Chang Culatana
Preprint
arXiv
ZeroSeg,a novel method that leverages the existing pretrained vision-language(VL) model to train
open-vocabulary zero-shot semantic segmentation models
|
|
|
Guiding Online Reinforcement Learning with Action-Free Offline Pretraining
Deyao Zhu,
Yuhui Wang,
Jürgen Schmidhuber,
Mohamed Elhoseiny
Preprint
arXiv /
code
Extract knowledge from datasets without action labels to help online reinforcement learning by
pretraining an Action-Free Decision Transformer to form intrinsic rewards.
|
|
|
Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning
Deyao Zhu,
Li Erran Li,
Mohamed Elhoseiny
ICLR, 2023
openreview /
arXiv /
code
Applying RL methods on a graph world model instead of the original complex environment simplifies the policy learning.
|
|
|
Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation
Abduallah Mohamed,
Deyao Zhu,
Warren Vu,
Mohamed Elhoseiny
Christian Claudel,
ECCV, 2022
arXiv /
code /
demo
A better metric for trajectory prediction that consider the whole prediction distribution.
|
|
|
RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition
Jun Chen,
Aniket Agarwal,
Sherif Abdelkarim,
Deyao Zhu,
Mohamed Elhoseiny
CVPR, 2022
arXiv/
code
Modeling an effective message-passing flow through an attention mechanism can be critical to tackling the compositionality and long-tail challenges in visual relationship recognition.
|
|
|
Motion Forecasting with Unlikelihood Training in Continuous Space
Deyao Zhu,
Mohamed Zahran,
Li Erran Li,
Mohamed Elhoseiny
CoRL, 2021   (Oral Presentation)
openreview
/
code
Reducing the likelihood of the context-violating predictions directly in the predicted distribution improves the prediction quality.
|
|
|
HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents
Deyao Zhu,
Mohamed Zahran,
Li Erran Li,
Mohamed Elhoseiny
ICLR, 2021
openreview
/
video
Hallucinating surrounding vehicles' driving intents helps model predict better.
|
|
|
Learning to Disentangle Latent Physical Factors for Video Prediction
Deyao Zhu, Marco Munderloh,
Bodo Rosenhahn,
Jörg Stückler
GCPR, 2019
openreview /
code /
video
Reducing the total correlation of the latent feature's dimensions to learn a physically disentangled representation of blocks.
|
|
Third-Place in Habitat Rearrangement Challenge 2022
|
|
Reviewer in TPAMI, CoRL 2022, ECCV 2022, AAAI 2023, CVPR 2023
|
|
Teaching Assistant in CS283 Deep Generative Model and CS326 Low Resource Deep Learning
|
|