Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new terse format #1

Open
Laurian opened this issue Feb 27, 2021 · 0 comments
Open

new terse format #1

Laurian opened this issue Feb 27, 2021 · 0 comments
Assignees
Labels

Comments

@Laurian
Copy link
Member

Laurian commented Feb 27, 2021

There is a need of a smaller JSON format that can be easily transferred in/out of server (think of multiple saves from an editor) and still be readable at least for debug purposes.

Also this format should cover usual STT output without data loss, allow for free text (corrections in an editor without alignment), and remixes.

Basically, for STT data where all the text can be synthesised from joining the words (items) by space, it could be this:
(note: several fields are optional, depending on what data you actually need for a specific use case)

{
  id: '', // optional
  type: 'transcript', // optional, might help when you mix storage with other types (remix, etc.)
  media: 'url or id', // optional
  duration: 3600, // optional
  metadata: {}, // optional
  segments: [
    {
      id: '', // optional
      start: 0, // optional, when present the start times in items[] are relative to this value!
      duration: 5, // optional
      speaker: 'name or id', // optional, should this be part of metadata field?
      metadata: {}, // optional
      items: [
        // ['text', start, duration, { /* optional metadata */ }]
        // duration can be omitted if there is no gap to the next item
        ["Ladies", 85.940, 0.370],
        ["and", 86.310, 0.120],
        ["gentlemen,", 86.430, 0.560],
      ]
    },
    // other segments
  ]
}

With this format, I had a reduction from the Amazon Transcribe output of a 3h transcript that resulted in 32 MB of JSON, to 600kB JSON.

Now bringing this into an editor will have to allow for unaligned text content, basically having the raw text per segment and each item have an offset in that text:

// a segment, showing only changes from above
{
  start: 0, // required if you need to align free text within segment time range;  start times in items[] are relative to this value!
  duration: 5, // required ^^^
  text: 'Ladies and gentlemen, today is', // required
  items: [
    // [offset, 'text', start, duration, { /* optional metadata */ }]
    // duration can be omitted if there is no gap to the next item
    [0, "Ladies", 85.940, 0.370],
    [7, "and", 86.310, 0.120],
    [11, "gentlemen,", 86.430, 0.560],
  ]
}

In special cases you could even make the items even more terse: [offset, length, start, duration], but that would be a bit unreadable and impair debugging.

As for a remix, all that is needed is the media (id or url) per segment (and start/duration as subclip in that media). Basically the transcript format is by default a remix of one media.

@Laurian Laurian self-assigned this Feb 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant