Skip to content

Extract features from different layers of a SSL model #15408

@gabitza-tech

Description

@gabitza-tech

Hello!

I would like to use a SSL model from this repo: https://github.com/Open-Speech-EkStep/vakyansh-models trained with NeMo for aligning hindi wavs in order to get alignments between spoken words using continous features extracted with the SSL model (like using a HuBERT model).

I would like to extract features from different layers, not just the last one like I do currently:

# Forward pass through SSL model
        with torch.no_grad():
            _,_,feat,feat_len = ssl_model.forward(
                input_signal=wav,
                input_signal_length=torch.tensor([wav.shape[-1]]).to(device)
            )

Extracting them like this yields very strange results when trying to align them like this when computing a similarity matrix:

Image

It clearly doesnt capture good phonetic information.. but I am also scared that I don't extract them correctly from the encoder. (p.s. I also apply VAD over the audios

Does anyone have experience with this? Any suggestions would be greatly appreciated!!!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions