[Remote] Student Researcher [Seed Multimodality & World Model – RL + Streaming Video Understanding] – 2026 Start (PhD)
Note: The job is a remote job and is open to candidates in USA. ByteDance is a leading technology company dedicated to pioneering advanced AI foundation models. They are looking for a PhD Intern to contribute to the development of real-time multimodal LLM-based agents for streaming video tasks, focusing on research in streaming video understanding and reinforcement learning.
Responsibilities
- Conduct research on streaming video understanding, especially for first-person or long-horizon applications, where the agent must continuously observe, interpret, and act
- Apply reinforcement learning to improve real-time perception and planning capabilities of streaming agents, including learning from human feedback, demonstrations, and/or verifiable rewards
- Build or enhance scalable data pipelines that convert offline video datasets into streaming-compatible formats, enabling the development of new agent capabilities
- Design and evaluate video agents that integrate LLMs/VLMs with decision-making components for downstream applications (e.g., tool use, retrieval, resolution switching)
Skills
- Currently pursuing a PhD in Computer Vision, Machine Learning, or a related field
- Research experience in video generation, world models, or dynamics modeling
- First-author publications in CVPR, ICCV, ECCV, NeurIPS, ICLR, or ICML
- Research experience in one or more of the following areas: Streaming video understanding, online video processing, or sequential decision making from continuous visual inputs
- Reinforcement learning (RL), especially when combined with LLMs or multimodal models (e.g., decision-making with VLMs, generative agents, action-planning)
- Data engineering, such as synthetic data generation, prompt engineering, scalable data pipeline curation
- Strong software engineering skills and ability to work in existing infrastructure (e.g., PyTorch, distributed training frameworks)
- Familiarity with streaming video processing in multimodal LLMs
- Experience working with RL for LLMs or multimodal LLMs
- Experience working with large-scale data pipelines, including multimodal dataset processing and task-specific synthetic data generation
Benefits
- Interns have day one access to health insurance
- Life insurance
- Wellbeing benefits and more
- Interns also receive 10 paid holidays per year
- Paid sick time (56 hours if hired in first half of year, 40 if hired in second half of year)
- Interns who are not working 100% remote may also be eligible for housing allowance.
Company Overview
Company H1B Sponsorship
Apply To This Job