In this paper, researchers explore zero-shot video QA using GPT-3, outperforming supervised models, leveraging narrative summaries and visual matching.
We introduced Long Story Short, a summarize-then-search method to understand both global narrative and the relevant details for video narrative QA. Our approach is effective when the context of QA is vast and a high-level interaction with such context is necessary to solve the said QA, which is the case in long video QAs. Also, we propose to further enhance the visual grounding of the model-generated answer by post-checking visual alignment with CLIPCheck. Our zero-shot method improves supervised state-of-art approaches in MovieQA and DramaQA benchmarks. We plan to release the code and the generated plot data to the public.
There are two possible research directions beyond this work: first, providing visual descriptions better aligned with the story with character re-identification and co-reference resolution improve input quality to GPT-3. Second, one can devise a more dynamic multi-hop search that combines global and local information in a hierarchical manner.