You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a need of a smaller JSON format that can be easily transferred in/out of server (think of multiple saves from an editor) and still be readable at least for debug purposes.
Also this format should cover usual STT output without data loss, allow for free text (corrections in an editor without alignment), and remixes.
Basically, for STT data where all the text can be synthesised from joining the words (items) by space, it could be this:
(note: several fields are optional, depending on what data you actually need for a specific use case)
{id: '',// optionaltype: 'transcript',// optional, might help when you mix storage with other types (remix, etc.)media: 'url or id',// optionalduration: 3600,// optionalmetadata: {},// optionalsegments: [{id: '',// optionalstart: 0,// optional, when present the start times in items[] are relative to this value!duration: 5,// optionalspeaker: 'name or id',// optional, should this be part of metadata field?metadata: {},// optionalitems: [// ['text', start, duration, { /* optional metadata */ }]// duration can be omitted if there is no gap to the next item["Ladies",85.940,0.370],["and",86.310,0.120],["gentlemen,",86.430,0.560],]},// other segments]}
With this format, I had a reduction from the Amazon Transcribe output of a 3h transcript that resulted in 32 MB of JSON, to 600kB JSON.
Now bringing this into an editor will have to allow for unaligned text content, basically having the raw text per segment and each item have an offset in that text:
// a segment, showing only changes from above{start: 0,// required if you need to align free text within segment time range; start times in items[] are relative to this value!duration: 5,// required ^^^text: 'Ladies and gentlemen, today is',// requireditems: [// [offset, 'text', start, duration, { /* optional metadata */ }]// duration can be omitted if there is no gap to the next item[0,"Ladies",85.940,0.370],[7,"and",86.310,0.120],[11,"gentlemen,",86.430,0.560],]}
In special cases you could even make the items even more terse: [offset, length, start, duration], but that would be a bit unreadable and impair debugging.
As for a remix, all that is needed is the media (id or url) per segment (and start/duration as subclip in that media). Basically the transcript format is by default a remix of one media.
The text was updated successfully, but these errors were encountered:
There is a need of a smaller JSON format that can be easily transferred in/out of server (think of multiple saves from an editor) and still be readable at least for debug purposes.
Also this format should cover usual STT output without data loss, allow for free text (corrections in an editor without alignment), and remixes.
Basically, for STT data where all the text can be synthesised from joining the words (items) by space, it could be this:
(note: several fields are optional, depending on what data you actually need for a specific use case)
With this format, I had a reduction from the Amazon Transcribe output of a 3h transcript that resulted in 32 MB of JSON, to 600kB JSON.
Now bringing this into an editor will have to allow for unaligned text content, basically having the raw text per segment and each item have an offset in that text:
In special cases you could even make the items even more terse:
[offset, length, start, duration]
, but that would be a bit unreadable and impair debugging.As for a remix, all that is needed is the media (id or url) per segment (and start/duration as subclip in that media). Basically the transcript format is by default a remix of one media.
The text was updated successfully, but these errors were encountered: