You can hack RoPE (Rotary Position Embedding) to create multi-shot videos without changing the model architecture: insert phase shifts at shot boundaries to signal transitions, and map reference images to specific spatiotemporal coordinates to control where subjects appear.
Deep Research is RAG's evolution into autonomous agency: the LLM doesn't just retrieve-then-answer—it plans query decomposition, decides WHEN to retrieve based on confidence, manages working memory across long horizons, and synthesizes verifiable reports with citations.
Store sparse keyframe snapshots (not dense 3D reconstructions) as a memory graph, localize via image-to-instance hybrid retrieval, and run global planning at low frequency while local control operates at high frequency. This mimics how humans actually navigate: "I remember that corner" not "I have a point cloud of every surface."
A lightweight "lightning indexer" can learn which tokens matter for attention, reducing O(L²) to O(Lk) complexity while preserving quality. Combined with allocating >10% of pre-training compute to post-training RL, this unlocks frontier-level reasoning in open models.
An 8B parameter model trained with multi-objective reinforcement learning (correctness + efficiency + user preference) can orchestrate stronger models and tools to outperform GPT-5 at 30% of the cost. The key insight: the "brain" coordinating the system doesn't need to be the biggest component—it just needs to learn WHEN to call expensive resources.
These summaries are for researchers who want to understand, not skim.
They're preparation for reading papers with the right mental model—not a replacement.