new terse format #1

Laurian · 2021-02-27T15:36:52Z

There is a need of a smaller JSON format that can be easily transferred in/out of server (think of multiple saves from an editor) and still be readable at least for debug purposes.

Also this format should cover usual STT output without data loss, allow for free text (corrections in an editor without alignment), and remixes.

Basically, for STT data where all the text can be synthesised from joining the words (items) by space, it could be this:
(note: several fields are optional, depending on what data you actually need for a specific use case)

{
  id: '', // optional
  type: 'transcript', // optional, might help when you mix storage with other types (remix, etc.)
  media: 'url or id', // optional
  duration: 3600, // optional
  metadata: {}, // optional
  segments: [
    {
      id: '', // optional
      start: 0, // optional, when present the start times in items[] are relative to this value!
      duration: 5, // optional
      speaker: 'name or id', // optional, should this be part of metadata field?
      metadata: {}, // optional
      items: [
        // ['text', start, duration, { /* optional metadata */ }]
        // duration can be omitted if there is no gap to the next item
        ["Ladies", 85.940, 0.370],
        ["and", 86.310, 0.120],
        ["gentlemen,", 86.430, 0.560],
      ]
    },
    // other segments
  ]
}

With this format, I had a reduction from the Amazon Transcribe output of a 3h transcript that resulted in 32 MB of JSON, to 600kB JSON.

Now bringing this into an editor will have to allow for unaligned text content, basically having the raw text per segment and each item have an offset in that text:

// a segment, showing only changes from above
{
  start: 0, // required if you need to align free text within segment time range;  start times in items[] are relative to this value!
  duration: 5, // required ^^^
  text: 'Ladies and gentlemen, today is', // required
  items: [
    // [offset, 'text', start, duration, { /* optional metadata */ }]
    // duration can be omitted if there is no gap to the next item
    [0, "Ladies", 85.940, 0.370],
    [7, "and", 86.310, 0.120],
    [11, "gentlemen,", 86.430, 0.560],
  ]
}

In special cases you could even make the items even more terse: [offset, length, start, duration], but that would be a bit unreadable and impair debugging.

As for a remix, all that is needed is the media (id or url) per segment (and start/duration as subclip in that media). Basically the transcript format is by default a remix of one media.

The text was updated successfully, but these errors were encountered:

Laurian added the question label Feb 27, 2021

Laurian self-assigned this Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new terse format #1

new terse format #1

Laurian commented Feb 27, 2021 •

edited

Loading

new terse format #1

new terse format #1

Comments

Laurian commented Feb 27, 2021 • edited Loading

Laurian commented Feb 27, 2021 •

edited

Loading