Skip to content
Discussion options

You must be logged in to vote

@GreatHalf thanks for flagging this and apologies for the delayed response.

The behavior comes from transcribe_speech_parallel.py. That script uses PyTorch Lightning’s predict() for multi-GPU / multi-node inference, and the timestamp utilities are not wired into that path. As a result, word-level start/end times are not returned there.

Timestamp outputs in seconds are available when using model.transcribe() or the transcribe_speech.py script.

If continuing with the parallel script, timestamps can be reconstructed from the offsets:

start_sec = start_offset * window_stride * model_subsampling_factor

for v3:

window_stride = 0.01
model_subsampling_factor = 8

example:

start_offset = 13
start_…

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
4 replies
@pzelasko
Comment options

@GreatHalf
Comment options

@nithinraok
Comment options

Answer selected by GreatHalf
@GreatHalf
Comment options

Comment options

You must be logged in to vote
1 reply
@GreatHalf
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants