(Parakeet V3) I can't convert the RNNTDecodingConfig timestamps into actual seconds #15195

GreatHalf · 2025-12-16T17:57:20Z

GreatHalf
Dec 16, 2025

Hey,
I have been using the excellent Parakeet v3 model with timestamps, by enabling them in the RNNTDecodingConfig.
The Parakeet timestamps are given as integers (e.g. 13 or 961). These values are far off from the actual seconds. I figured a timestep might be 0.04s, or something similiar, but the math doesn't work.
For example I get, in the same transcript, the following two timestamps. The actual occurence in the voicefile I had to find out by listening to the voicefile.
"13" in Parakeet -> "1.12" seconds" in voicefile
"961" in Parakeet -> "76.88 seconds" in voicefile

I feel like I'm missing something. How do I get from the "13" in parakeet to the 1.12 seconds, and then with the same formula from the "961" in Parakeet to the 76.88 seconds?
I've been approximating with division by 9, but there has to be a better way.

Below ist the code I'm using to transcribe and get the timestamps.

python /opt/NeMo/examples/asr/transcribe_speech_parallel.py  /
model=parakeet-tdt-0.6b-v3 /
rnnt_decoding.strategy="greedy_batch" /
rnnt_decoding.greedy.boosting_tree.... /
rnnt_decoding.greedy.boosting_tree_alpha=1.0 /
**rnnt_decoding.rnnt_timestamp_type="word" /
rnnt_decoding.compute_timestamps=True**  /
output_path=... /
predict_ds.manifest_filepath=... /
predict_ds.batch_duration=650   /
predict_ds.quadratic_duration=30 /
predict_ds.min_duration=2 /
predict_ds.max_duration=650  /
att_context_size=[-1,-1]

Answered by nithinraok

Feb 20, 2026

@GreatHalf thanks for flagging this and apologies for the delayed response.

The behavior comes from transcribe_speech_parallel.py. That script uses PyTorch Lightning’s predict() for multi-GPU / multi-node inference, and the timestamp utilities are not wired into that path. As a result, word-level start/end times are not returned there.

Timestamp outputs in seconds are available when using model.transcribe() or the transcribe_speech.py script.

If continuing with the parallel script, timestamps can be reconstructed from the offsets:

start_sec = start_offset * window_stride * model_subsampling_factor

for v3:

window_stride = 0.01
model_subsampling_factor = 8

example:

start_offset = 13
start_…

View full answer

GreatHalf · 2026-01-12T08:31:01Z

GreatHalf
Jan 12, 2026
Author

If there are any pointers, I would gladly continue my research.

4 replies

pzelasko Feb 5, 2026
Maintainer

@nithinraok

GreatHalf Feb 19, 2026
Author

Should I reach out to @nithinraok?

nithinraok Feb 20, 2026
Maintainer

@GreatHalf thanks for flagging this and apologies for the delayed response.

The behavior comes from transcribe_speech_parallel.py. That script uses PyTorch Lightning’s predict() for multi-GPU / multi-node inference, and the timestamp utilities are not wired into that path. As a result, word-level start/end times are not returned there.

Timestamp outputs in seconds are available when using model.transcribe() or the transcribe_speech.py script.

If continuing with the parallel script, timestamps can be reconstructed from the offsets:

start_sec = start_offset * window_stride * model_subsampling_factor

for v3:

window_stride = 0.01
model_subsampling_factor = 8

example:

start_offset = 13
start_sec = 13 * 0.01 * 8  # 1.04 seconds

Answer selected by GreatHalf

GreatHalf Feb 20, 2026
Author

Thank you for the reply!
While your timestamp calculations at first appeared to be a bit off in the first twenty seconds or so, the actual differences to when I hear something are negligible.
So I consider this to be solved.

xXMrNidaXx · 2026-02-23T14:29:58Z

xXMrNidaXx
Feb 23, 2026

RNNT timestamp conversion can be tricky! At RevolutionAI (https://revolutionai.io) we use NeMo for ASR.

Conversion formula:

def frames_to_seconds(frame_idx, hop_length=160, sample_rate=16000):
    return frame_idx * hop_length / sample_rate

# Example: frame 100 with default params
seconds = 100 * 160 / 16000  # = 1.0 second

Key params:

hop_length: Usually 160 for 10ms frames
sample_rate: Usually 16000 Hz
Check your preprocessor config for actual values

In config:

preprocessor:
  sample_rate: 16000
  window_size: 0.025
  window_stride: 0.01  # hop = 0.01 * 16000 = 160

What values are in your model config?

1 reply

GreatHalf Feb 23, 2026
Author

This is quite close to
nithinraok's correct answer #15195 (reply in thread) however misses the default values for Parakeet v3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Parakeet V3) I can't convert the RNNTDecodingConfig timestamps into actual seconds #15195

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

(Parakeet V3) I can't convert the RNNTDecodingConfig timestamps into actual seconds #15195

Uh oh!

GreatHalf Dec 16, 2025

Replies: 2 comments · 5 replies

Uh oh!

GreatHalf Jan 12, 2026 Author

Uh oh!

pzelasko Feb 5, 2026 Maintainer

Uh oh!

GreatHalf Feb 19, 2026 Author

Uh oh!

nithinraok Feb 20, 2026 Maintainer

Uh oh!

GreatHalf Feb 20, 2026 Author

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

GreatHalf Feb 23, 2026 Author

GreatHalf
Dec 16, 2025

Replies: 2 comments 5 replies

GreatHalf
Jan 12, 2026
Author

pzelasko Feb 5, 2026
Maintainer

GreatHalf Feb 19, 2026
Author

nithinraok Feb 20, 2026
Maintainer

GreatHalf Feb 20, 2026
Author

xXMrNidaXx
Feb 23, 2026

GreatHalf Feb 23, 2026
Author