A Summarize-then-Search Method for Long Video Question Answering: Method
2024-5-26 21:0:29 Author: hackernoon.com(查看原文) 阅读量:1 收藏

Read on Terminal Reader

Too Long; Didn't Read

In this paper, researchers explore zero-shot video QA using GPT-3, outperforming supervised models, leveraging narrative summaries and visual matching.

featured image - A Summarize-then-Search Method for Long Video Question Answering: Method

Kinetograph: The Video Editing Technology Publication HackerNoon profile picture

2. Method

Figure 2: The qualitative result showing our proposed Long Story Short (LSS) model that generates and retrieves the index of raw video footage. When the model predicts the final answer from (i) the generated Summary and (ii) the retrieved text context, CLIPCheck validates each candidate’s answers to revise the final answer for the question.

2.1. Plot Generation

Given the summarized narrative and the question, we wish to retrieve the relatively short clip relevant to the question from the long video. Language models generate open-ended text which is irregular and often noisy. To retrieve the exact part of the video, we drive the model to output indices of the plot rather than the text form.

The generated indices might still be noisy due to the open-ended nature of language models. When the model outputs an answer in text form, we use rouge-l [19] score to find plot piece candidates whose similarity with the generated sentence are above the specified threshold α ≥ 0.5.

2.3. Visual Checking


文章来源: https://hackernoon.com/a-summarize-then-search-method-for-long-video-question-answering-method?source=rss
如有侵权请联系:admin#unsafe.sh