Sequence score includes scores for lang id token

In machine.py, we do something like this, so that only scores for non-special tokens are bubbled up:
```
        for item in zipped:
            output_ids, scores, sequence_score, attentions = cast(
                Tuple[torch.Tensor, torch.Tensor, Optional[float], Optional[torch.Tensor]], item
            )
            output_tokens: List[str] = []
            output_indices: List[int] = []
            for i, output_id in enumerate(output_ids):
                id = cast(int, output_id.item())
                if id not in all_special_ids:
                    output_tokens.append(self.tokenizer.convert_ids_to_tokens(id))
                    output_indices.append(i)

            scores = scores[output_indices]
```

In silnlp, we do something similar downstream in `hugging_face_config.py:translate()`.

However, we grab the `sequence_scores` directly from the model outputs and these sequence scores seem to include the BOS token score which is close to 0. This will presumably bias the sequence score slightly so that shorter output sequences would have a score closer to zero. 

We should confirm that these special token scores are being included in the sequence score and then update silnlp accordingly if they are (or maybe even consider submitting an issue in transformers).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sequence score includes scores for lang id token #966

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Sequence score includes scores for lang id token #966

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions