(Parakeet V3) I can't convert the RNNTDecodingConfig timestamps into actual seconds #15195
-
|
Hey, I feel like I'm missing something. How do I get from the "13" in parakeet to the 1.12 seconds, and then with the same formula from the "961" in Parakeet to the 76.88 seconds? Below ist the code I'm using to transcribe and get the timestamps. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
|
If there are any pointers, I would gladly continue my research. |
Beta Was this translation helpful? Give feedback.
-
|
RNNT timestamp conversion can be tricky! At RevolutionAI (https://revolutionai.io) we use NeMo for ASR. Conversion formula: def frames_to_seconds(frame_idx, hop_length=160, sample_rate=16000):
return frame_idx * hop_length / sample_rate
# Example: frame 100 with default params
seconds = 100 * 160 / 16000 # = 1.0 secondKey params:
In config: preprocessor:
sample_rate: 16000
window_size: 0.025
window_stride: 0.01 # hop = 0.01 * 16000 = 160What values are in your model config? |
Beta Was this translation helpful? Give feedback.
@GreatHalf thanks for flagging this and apologies for the delayed response.
The behavior comes from
transcribe_speech_parallel.py. That script uses PyTorch Lightning’spredict()for multi-GPU / multi-node inference, and the timestamp utilities are not wired into that path. As a result, word-level start/end times are not returned there.Timestamp outputs in seconds are available when using
model.transcribe()or thetranscribe_speech.pyscript.If continuing with the parallel script, timestamps can be reconstructed from the offsets:
for v3:
example: