Lazy Tokenization of Meta Type Features Causes Inconsistent Encoding in Block Dataset Building

**Problem Description**:
When building datasets in blocks, lazy tokenization of "meta" type features leads to inconsistent global mapping across different blocks, causing data mismatches in downstream tasks.

**Context**:
In `fuxictr/preprocess/build_dataset.py`, the `build_dataset` function calls `transform`, which processes data in blocks when `block_size > 0`. Each block is processed in parallel using `transform_block`.

**Issue Details**:
The problem occurs in `fuxictr/preprocess/feature_processor.py` during the transformation of "meta" type features. For each block:
1. The tokenizer's `encode_meta` method is called
2. A new vocabulary is updated **per block**
3. The block data is encoded using this block-specific vocabulary
4. The encoded block is saved to disk

This results in:
- Different blocks having different token-to-ID mappings for the same meta feature
- No consistent global vocabulary across the entire dataset
- Potential mismatches when aggregating or analyzing data across blocks

**Code References**:
1. **Block Processing** (`build_dataset.py`):
```python
def transform_block(feature_encoder, df_block, filename):
    df_block = feature_encoder.transform(df_block)  # Problem occurs here
    # ... save to parquet

def transform(feature_encoder, ddf, filename, block_size=0):
    if block_size > 0:
        # Process each block independently
        pool.apply_async(transform_block, ...)  # Each block gets its own vocab
```

2. **Meta Feature Encoding** (`feature_processor.py`):
```python
# feature encoder's transform, here the ddf is block ddf
def transform(self, ddf):
    for feature, feature_spec in self.feature_map.features.items():
        if feature_type == "meta":
            tokenizer = self.processor_dict[feature + "::tokenizer"]
            ddf[feature] = tokenizer.encode_meta(col_series)  # Updates vocab per block
```

3. **Tokenizer Implementation** (`tokenizer.py`):
```python
def encode_meta(self, series):
        word_counts = dict(series.value_counts())
        if len(self.vocab) == 0:
            self.build_vocab(word_counts)
        else: # update vocabs with block data
            self.update_vocab(word_counts.keys())
        series = series.map(lambda x: self.vocab.get(x, self.vocab["__OOV__"]))
        return series.values
```


**Suggested Solutions**:
 Pre-build global vocabulary before block processing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy Tokenization of Meta Type Features Causes Inconsistent Encoding in Block Dataset Building #164

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Lazy Tokenization of Meta Type Features Causes Inconsistent Encoding in Block Dataset Building #164

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions