ShortScribe

Making short-form videos accessible with hierarchical video summaries

Frame-79-2

Challenge

Today, more than 1 billion users actively watch short-form videos across platforms such as TikTok, Instagram Reels, and YouTube Shorts [1]. Short-form videos usually range from 30 to 60 seconds in length and are presented in an algorithmic stream. While short-form videos are now a dominant source of information, entertainment, and new cultural references, they remain inaccessible to millions of blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays.

Prior work has explored making normal videos accessible with narrations of visual content in the video. However, there remains a gap on how to make short-form videos accessible to blind and low vision (BLV) viewers.

Final Solution

What is the final product?

We present ShortScribe, which contains a video pane, a description pane, and a supported system that provides BLV short-form video viewers with hierarchical video summaries, or video descriptions at four levels of detail that viewers can access depending on their interest. Both interfaces integrated with the supported system were compatible with popular screen readers such as Apple VoiceOver and NVDA.
(Note: In this section, blue indicates redesigned elements, while orange represents our initial design).

Frame-56-3

How does it work?

To switch between videos, users navigate to the previous or next buttons with the screen-reader and double-tap the buttons. To play and pause a video, users double-tap the play. When user use the screen reader, the default reading order of sections on video pane is A->B->C->D, and within each section it reads from the top to bottom (Refer to 2:21–3:15 in the video below).

User can click description pane button (section C) to enter description pane and use screen reader to read long description, on-screen text and shot-by-shot description. To go back to the video pane, users can double tap the close button.

Discovery

Literature Review

Lit. Review 1

Video Accessibility

Lit. Review 2

Hierarchical Summaries and Descriptions

Lit. Review 3

Social Media Accessibility

Click to see more (format: prior work -> opportunity)

Created time-aligned descriptions for long-form videos.
Used alternative modalities such as haptics or sound for video accessibility, the practice of adding text descriptions for short-form video was not widespread.

We explored not time-aligned descriptions accommodated for short-form videos.
We explored using automatically generated text.

Prior work had used hierarchical summaries to support searching, browsing, and skimming of text and speech-based content (e.g. long text, audio recordings)

We considered providing hierarchical summarization of visual descriptions of videos to facilitate gaining a high-level overview or low-level details about visual content flexibly.

There existed iterated guidance on effective descriptions for images and videos on social media.

We built on such prior work by exploring current practices of BLV short-form video audiences and exploring text descriptions for making such videos accessible.

User Research

7 Participants

recruitment-1-1-2

• 20-57 years old, blind (5 participants), or some light perception.
• Use a screen reader on mobile devices and have experience watching short-form videos.
• Recruited by social media and mailing list.

Background of Participants (L – left eye, R – right eye)

WeChat75065c5d735b40d66a6fb3f13c1d953e-2

Research Method

We conducted a 1.5-hour-long study via 1:1 remote Zoom interviews following our protocol.

Section 1 Demographic and Background Questions (30 mins)

Section 2 Pre-Selected Video Watching Task (30 mins)

We then asked participants to watch 8 pre-selected short-form videos in a random order, provided accessibility ratings from 1 to 7 (completely inaccessible to completely accessible), and explained what factors impacted their rating, similar to Liu et al. [17].

How did we select the suitable video?

• To ensure the 8 preselected videos covered a wide range of accessibility levels, we first created a 4-point scale to categorize the level of audio-visual match [16]: Unrelated Audio (e.g., a trending song), Somewhat Related Audio (e.g., a meme-style audio), Partially Informative Audio (e.g., recipe narration that covers some but not all details), and Mostly Informative Audio (e.g., a talking head).

• Videos were selected by scrolling through a newly created TikTok account, eliminating videos not in English or containing inappropriate content, until we found 4 videos per category.

• Finally, 2 videos were selected for each category such that the entire sample of videos covered a wide range of lengths and topics.

Frame-52-2

Section 3 Self-Selected Video Co-Watching Session (15 mins)

We invited participants to think aloud as they watched videos on their own short-form video feed with their preferred platform.

Section 4 Post-Task Interview (15 mins)

Data Analysis

To collect data, one of the authors took notes during the interviews. Another author re-watched the entire set of Zoom recordings, adding to the notes and constructing an affinity diagram [18]. The two authors later discussed themes and emerged.

Key Findings

Finding 1

Current Practice

frequency-1

Watching Frequency and Purpose

Participants watched short-form videos daily or weekly mainly for entertainment, engaging with content shared by friends, and seeking information about specific interests like music, politics, etc.

“When I first started, I watched longer videos. But I find now I move to shorter videos, videos that get to the point quickly.” – P6

problem-solving-1

Accessibility Barriers and Methods

All participants faced accessibility barriers when accessing short-form videos. When encountering barriers, most participants skip the video, but others seek help from close friends or social media groups or use the author-written video caption. Participants expressed frustration when they failed to find more information to better understand the video, feeling that they might miss crucial content.

Finding 2

Short Form Video Accessibility

Overall, participants rated the accessibility of the pre-selected videos as a 3.21 (σ = 1.94) on a scale of 1 (completely inaccessible) to 7 (completely accessible). Similar to Liu et al. [18], participants noted that videos with more speech such as podcast excerpts, and singing or music related videos were more accessible than those with less speech.

WeChat3b396a1545de58d5f88441ddeb43c7cd-1

Short-form videos also presented unique accessibility challenges:

multimedia-1-1

Repurposed Audio

Short-form videos that reused audio from other sources were challenging to understand. The audio may be somewhat related to the visual content, but still didn’t contain enough information to understand.

on-air-1

Micro Videos

Extremely short videos (5 seconds or less) often had uninformative or ambiguous audio that interfered with screen reader audio, making these videos inaccessible.

reaction-video-1

Reaction Video

Reaction videos in which one creator stitched the video of another creator to react to it, were difficult to understand from audio alone.

animation-1

Complex Actions

Short videos with excessive animations and movements were not accessible, as these videos are unlikely to be adequately described in a limited time.

Finding 3

Platform Accessibility

Participants rated platform accessibility as 3.36 (σ = 1.49) on a scale of 1 to 7 (completely inaccessible or accessible, respectively) and they highlighted several key accessibility barriers:

animation-1-1

Video Controls

All participants reported that play/pause video controls were challenging to use on TikTok as they required tapping the screen, a gesture that was not supported when using VoiceOver. Difficulties with other playback controls like skipping, fast-forwarding, and rewinding were also mentioned.

like-1

Button Labels

All participants reported that clutter and a lack of clarity in button labels led to the platform inaccessible.

“The comment buttons and the share buttons, I don’t know which video even they are connected to. I may find a play button, but it’s not necessarily the one for the video that I’m trying [to play].” – P6

updated-1

Platform Updates

Five participants noted that updates in the platform layout, particularly changes in the button positions, incurred a steep learning curve and led to moments of inaccessibility.

Finding 4

Participant Suggested Accessibility Improvements

description-1

Add descriptions for short videos

(6 participants)

text-mining-1

Provide access to the text on-screen

(5 participants)

caption-1

Easier access to the author-written video caption and user comments

(3 participants) These helped participants decide whether or not to watch a video.

tap-1-1

Developing additional VoiceOver gestures

(2 participants) The purpose is faster navigation.

Problem Statement

Based on user research, the problem is defined by the need to enhance accessibility for short-form videos on TikTok and improve the accessibility of the TikTok platform itself so that BLV users can enjoy TikTok feeds without frustration or exclusion. This leads to three specific design objectives.

Design Objective 1

Video Selection

Users wanted to maintain a fast-paced video viewing and browsing experience so that they could decide which video to watch (i.e. video selection) as efficiently as sighted users. To achieve this, one key aspect was to allow BLV users to swiftly and accurately grasp a video’s content overview — a challenge we needed to solve.

Design Objective 2

Video Accessibility

The second user requirement is to easily obtain accurate, multi-level details about their selected video. The format of the selected video varied, including visual videos with minimal dialogue, micro videos, repurposed audio, reaction clips, etc.

Design Objective 3

Platform Accessibility

The third user requirement is to enhance the accessibility of the control panel, enabling easy navigation for BLV users via screen readers like VoiceOver.

Ideation

Brainstorm

Design Objective 1

Video Selection

Idea

Short Description

To grasp a video’s content overview quickly and decide whether to watch it, we provided a short description of video content so that BLV users can quickly access essential information and make decisions.

Design Objective 2

Video Accessibility

Idea

On-Screen Text, long description, shot-by-shot Description

To allow BLV users to easily obtain multi-level details about their various selected videos, we provided an on-screen text, a long description, and a shot-by-shot breakdown, allowing users to select based on their preference for detail and accuracy.

Design Objective 3

Platform Accessibility

Idea

Video Playback Control

To address inaccessible video controls for screen readers, we could redesign the interface with accessibility guidelines to achieve more efficient navigation for BLV users.

Merge & Selection

Frame-53

Final Idea Rationale

Frame-58

Implementation (Pipeline)

WeChatd3cbf8f34908ec21ec9dc48a7fae163e-1

Evaluation

Pipeline Evaluation

We measured the coverage and accuracy of the short descriptions, long descriptions, shot-by-shot descriptions, and BLIP-2 generated image descriptions (i.e. on-screen text) from ShortScribe.

Dataset

Frame-54-1

Dataset 2 (8 short-form videos)

WeChatfa4e932b2a9628685f2e826dfb0749db

Data Analysis

Frame-55-6

An example of accuracy analysis of a video

Here the hallucinations are labeled red in each description type. The video depicts a lighthearted singalong. BLIP-2 mistakenly recognizes a toddler concentrating on singing as angry, and the on-screen text shows a quiz with the lyrics to a sad song (All Too Well by Taylor Swift). The long description and then the short description incorrectly infer that the video is sad.

WeChatf6aff65d2be49859ffe913ca31598dda

We calculated the weighted Cohen’s kappa for each type of description to demonstrate the inter-rater reliability of our accuracy and coverage codes as evaluated by two researchers.

Result

Inter-rater reliability

The result of “moderate” to “almost perfect” agreement between two researchers shows our evaluation result here is valid and reliable.

Why?

Agreement was “moderate” to “almost perfect” (k = 0.50 – 1.0) for accuracy and coverage across all description types except shot-by-shot description accuracy that had “fair” agreement (k = 0.24) [19].

In examining differences for shot-by-shot descriptions, researcher counts of inaccurate statements differed by 1.16 on average (σ =1.38). As shot-by-shot descriptions were much longer than other descriptions (averaging 190 words) disagreements primarily occurred due to one researcher noticing an error the other missed and vis versa, rather than disagreement in the established code.

Accuracy

WeChateff3b93e4173d99e72f3e3d93930a7c6

Overall, a majority of short, 50-word and long descriptions did not contain incorrect statements. Short descriptions had the lowest percentage of videos with hallucinations while shot-by-shot descriptions had the highest.

Coverage

9211708105608_.pic_-2

Overall, long descriptions and shot-by-shot descriptions captured all of the important details that we identified for the eight videos . The short and 50-word descriptions capture 75% and 90% of the important details respectively.

ShortScribe descriptions are comparable to human video descriptions in terms of coverage but are generally more verbose than human descriptions

User Study

10 Participants

recruitment-1-1-2

• 6 were new and 4 participated in the formative study, criteria were the same as the recruitment in formative research.

Research Method

We conducted a within-subjects study, participants used both ShortScribe and a baseline interface to watch short-form videos. The study was 1.5 hours long and conducted in a 1:1 session via Zoom.

The study consisted of two tasks: a video comprehension task (Session 2) and a video selection task (Session 3). For each task, participants viewed one video group with the ShortScribe interface and one video group with the baseline interface. We randomized and counterbalanced the videos and interface pairs within each task. Here is the protocol procedure:

Section 1 Background Questions and Tutorial (20 mins)

Section 2 Video Comprehension Task (30 mins)

An example of experiment design

• Dataset: We split the 8 selected videos (i.e. Dataset 2) into two groups (VG1 and VG2). Each group had a video with reused meme audio, a video with audio original to the creator, a recipe video with limited auditory description, and a talking head of a person listing items within a theme. This ensured that both groups had similar levels of accessibility.

• Experiment design:
2 participants → VG 1 (baseline interface) and VG 2 (ShortScribe interface)
2 participants → VG 2 (ShortScribe interface) and VG 1 (baseline interface)
2 participants → VG 1 (ShortScribe interface) and VG 2 (baseline interface)
2 participants → VG 2 (baseline interface) and VG 1 (ShortScribe interface)

• Instruction: For each video, we allowed participants to investigate as much information as they wanted and watch the video as many times as they wanted. We then asked participants to provide a short summary of the video, share any questions they had about the video, and rank their understanding of the video on a scale from 1 (did not understand) to 7 (completely understood).

Section 3 Section 3 Video Selection Task (20 mins)

Section 4 Final Interview (10 mins)

Data Analysis

We recorded all interactions with both interfaces and transcribed the interviews. The researchers graded the participant summaries using the same human-generated rubrics that analyzed coverage in the pipeline evaluation and derived themes.

Key Findings

At A High Level

9271708351553_.pic_

Participants rated ShortScribe on a scale from 1 (not useful) to 7 (very useful) on how useful they found it for both understanding and selecting videos. For both of these questions, participants rated ShortScribe as significantly more useful compared to the baseline interface: video comprehension (Z = 2.78, p < 0.01), video selection (Z = 2.63, p < 0.01).

All of the description features provided by ShortScribe (long, short, shot-by-shot, and on-screen text) were ranked higher in terms of usefulness when compared to the baseline information (original caption, engagement numbers, audio source, username).

Overall, all participants prefer to use ShortScribe to watch short-form videos over the baseline interface. Their average willingness to use ShortScribe in the future was 6.7 (σ = 0.67) on a scale from 1 (not likely to use in the future) to 7 (very likely to use in the future).

Specifically

Video Comprehension ← Design Objective 2

Untitled

Interpretations of two figures.

Video comprehension for videos V1-V8 using our ShortScribe system (left bar, blue) and a baseline interface (right bar, orange) are measured by scoring participant written video summaries (left figure, Video Summary Scores ) and participant’s ratings of their video understanding (right figure, Video Understanding Ratings). Ratings of the video understanding ranged from 1, did not understand, to 7, completely understood. Error bars depict the 95% confidence interval.

Both the accuracy of participant-written summaries (objective; Z = 4.61, p < 0.01) and the participant’s self-reported video understanding (subjective; Z = 4.99, p < 0.01) significantly improved when using ShortScribe compared to the baseline interface.

The main cause was participants spent more time reading the descriptions, which offered more accuracy and improved comprehension, rather than spending time on captions, which were created by authors with variability and could be misleading.

“Usually captions are not as helpful because people are trying to draw attention, and put in hashtags.” – p3

How do long descriptions, shot-by-shot descriptions, and on-screen text help video comprehension specifically?

Long Description: Participants rated it 6.2 (σ = 0.89) on the 7-point scale for usefulness as it provided a good balance between detail and conciseness.

Shot-by-shot Description: Participants rated it 6 (σ = 1.25) as these descriptions offered them the most detailed and sequential narrative of the video.

“It get the sequence and the whole message of the video”-P6

On-screen text: Participants reported it was particularly useful when videos contained important text information, such as ingredients in a recipe or a joke.

“It is a personal touch from the creator of the video”. – p10

Participants appreciated the flexibility of choosing which descriptions to read, allowing them to establish a balance between the effort they are willing to invest and their desire for more information.

“if there’s too much information, its overwhelming”. -p9

Video Selection ← Design Objective 1

Participants rated the short description an average of 5.7 (σ = 1.06) for usefulness and reported it helped gain a concise overview of the video to assess whether they were interested in watching the video and/or exploring additional descriptions.

“It is a brief glance about what it’s gonna be about” – P9

Interaction Improvements ← Design Objective 3

Participants appreciated explicit video controls in our interface, which was a significant improvement over the original TikTok video interface. But they also suggested access to the length of the video, and enabling auto-pause instead of auto-play while scrolling.

“It is a brief glance about what it’s gonna be about” – P9

Other Benefits

Even for videos considered accessible, users could use the descriptions to quickly confirm that they were not missing any information.

Unknown Error

Although the accuracy of descriptions was proven high in pipeline evaluation, participants still encountered some errors in the ShortScribe descriptions, often without being aware of them.

Redundancy

There are too much repetition of the same content across the 4 descriptions.

Discussion

Potential Impact

Significant support for BLV users in consuming short-form videos.

creative

ShortScribe significantly enhanced the BLV users experience in consuming short-form videos. The high coverage and low number of errors for descriptions make them immediately useful such that all users reported wanting to use the descriptions in the future.

“this makes me feel like I could view more videos” (P3)

“it would give me a whole new avenue […] I would even pay for it” (P6)

Potential for Further Application

processor

Our work demonstrates the potential to create useful visual descriptions at multiple levels of detail for temporal media. We can extend ShortScribe to long-form videos, live stream recordings, or 360-degree videos, and apply for diverse user contexts like digital comics, shopping images, etc.

Limitation

Accuracy of the descriptions

redundancy

Our summarization approach generally corrects earlier errors in the pipeline but can sometimes exacerbate them. Despite no direct feedback from users about these errors, it’s important to address them to avoid bias and exclusion of BLV audiences. Future efforts will focus on updating our pipeline’s vision-to-language model to reduce the error.

Redundancy of the descriptions

duplication

Our pipeline produces description redundancies, both with the original audio of a video and between different descriptions. Future systems should support users in deciding whether to allow repetition or not.

Complex Action Video and reaction video challenges

dance-1

Still, ShortScribe lacked support for specific types of videos such as videos with complex actions or reaction videos. Our pipeline struggled to produce good audio descriptions for intricate actions within a tight timeframe, a recognized challenge, with dance videos—characterized by quick, complex movements and expressive facial cues to convey emotions—being a notable example.

Complex Action Video and reaction video challenges

hand

Our one-size-fits-all pipeline does not allow customization of the description content. Future work may explore letting users directly customize summarization prompts (e.g., “leave out any color information”).

Key Takeaways

1. Quantitative and Qualitative Research Method: I employed both quantitative and qualitative research methods, including a formative study with task-based interviews and a within-subjects user study. These approaches provided valuable experience in designing contextually appropriate studies and executing in-depth data analysis, yielding meaningful insights.

2. Work with the Developing Team: As the sole user researcher among developers, I learned how to effectively integrate into the team’s workflow and provide support. For instance, we co-designed a system prototype to balance design considerations with development feasibility. I also jointly evaluated the pipeline capabilities when I realized it would provide valuable insights for user studies design.

3. Work with Accessibility Group: Working with the accessibility group, I deepened my empathy and understanding of Blind and Low Vision (BLV) users, which increased my inclusivity and accessibility design experience.

References