Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/bnlp-publish-auto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: '3.x'
- name: Install dependencies
Expand Down
11 changes: 4 additions & 7 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ name: Building and Testing

on:
push:
branches: [ "main" ]
branches: [ "main", "upgrade_for_python12" ]
pull_request:
branches: [ "main" ]

Expand All @@ -17,15 +17,12 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
exclude:
- os: ubuntu-latest
python-version: "3.6"
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
```
pip install -U bnlp_toolkit
```
- Python: 3.8, 3.9, 3.10, 3.11
- Python: 3.8, 3.9, 3.10, 3.11, 3.12, 3.13
- OS: Linux, Windows, Mac

### Build from source
Expand Down
4 changes: 2 additions & 2 deletions bnlp/cleantext/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

from ftfy import fix_text
from unicodedata import category, normalize
from emoji import UNICODE_EMOJI, demojize, emojize
import emoji

def fix_bad_unicode(text, normalization="NFC"):
return fix_text(text, normalization=normalization)
Expand Down Expand Up @@ -51,7 +51,7 @@ def remove_substrings(text, to_replace, replace_with=""):
return result

def remove_emoji(text):
return remove_substrings(text, UNICODE_EMOJI["en"])
return emoji.replace_emoji(text, replace="")

def remove_number_or_digit(text, replace_with=""):
return re.sub(constants.BANGLA_DIGIT_REGEX, replace_with, text)
Expand Down
1 change: 0 additions & 1 deletion bnlp/embedding/glove.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import scipy
import numpy as np
from typing import List
from scipy import spatial
Expand Down
2 changes: 0 additions & 2 deletions bnlp/embedding/word2vec.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function

import warnings
warnings.filterwarnings("ignore")

Expand Down
21 changes: 5 additions & 16 deletions bnlp/tokenizer/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,17 @@
Code shamelessly copied from BERT tokenization
To check Original Code: https://github.com/google-research/bert/blob/master/tokenization.py
"""
import six
import unicodedata
from typing import List

def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Not running on Python2 or Python 3?")
raise ValueError("Unsupported string type: %s" % (type(text)))


def whitespace_tokenize(text: str) -> List[str]:
Expand Down
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Table of contents
```
pip install -U bnlp_toolkit
```
- Python: 3.6, 3.7, 3.8, 3.9, 3.10
- Python: 3.8, 3.9, 3.10, 3.11, 3.12, 3.13
- OS: Linux, Windows, Mac


Expand Down
29 changes: 15 additions & 14 deletions docs/REFACTORING_ANALYSIS.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,25 +443,26 @@ $ bnlp download all

| Dependency | Pinned | Latest | Risk |
|------------|--------|--------|------|
| `scipy==1.10.1` | Yes | 1.11+ | Security patches |
| `gensim==4.3.2` | Yes | 4.3.3+ | Bug fixes |
| `emoji==1.7.0` | Yes | 2.x | Breaking changes (API changed) |
| `sklearn-crfsuite==0.3.6` | Yes | 0.5+ | Compatibility |
| `scipy>=1.11.0` | No | 1.13+ | Updated for Python 3.12+ |
| `gensim>=4.3.3` | No | 4.4+ | Updated for scipy compatibility |
| `emoji>=2.0.0` | No | 2.15+ | Updated, code migrated to new API |
| `sklearn-crfsuite>=0.5.0` | No | 0.5+ | Updated for Python 3.12+ |

### 6.2 Recommended Approach

```python
install_requires=[
"sentencepiece>=0.2.0,<0.3.0",
"gensim>=4.3.0,<5.0.0",
"nltk>=3.8",
"numpy>=1.21",
"scipy>=1.10.0,<2.0.0",
"sklearn-crfsuite>=0.3.6,<1.0.0",
"tqdm>=4.60.0",
"ftfy>=6.0.0",
"emoji>=1.7.0,<2.0.0", # Note: emoji 2.x has breaking changes
"requests>=2.25.0",
"sentencepiece>=0.2.0",
"gensim>=4.3.3",
"nltk",
"numpy",
"scipy>=1.11.0",
"sklearn-crfsuite>=0.5.0",
"tqdm>=4.66.3",
"ftfy>=6.2.0",
"emoji>=2.0.0",
"requests",
"symspellpy>=6.7.0",
],
```

Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ PIP installer

pip install -U bnlp_toolkit

- Python: 3.6, 3.7, 3.8, 3.9, 3.10
- Python: 3.8, 3.9, 3.10, 3.11, 3.12, 3.13
- OS: Linux, Windows, Mac

Pretrained Model
Expand Down
14 changes: 7 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
sentencepiece==0.2.0
gensim==4.3.2
sentencepiece>=0.2.0
gensim>=4.3.3
numpy
scipy==1.10.1
sklearn-crfsuite==0.3.6
tqdm==4.66.3
ftfy==6.2.0
emoji==1.7.0
scipy>=1.11.0
sklearn-crfsuite>=0.5.0
tqdm>=4.66.3
ftfy>=6.2.0
emoji>=2.0.0
requests
nltk
symspellpy>=6.7.0
18 changes: 9 additions & 9 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

setuptools.setup(
name="bnlp_toolkit",
version="4.4.0",
version="4.4.1",
author="Sagor Sarker",
author_email="sagorhem3532@gmail.com",
description="BNLP is a natural language processing toolkit for Bengali Language",
Expand All @@ -18,17 +18,17 @@
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
python_requires=">=3.6",
python_requires=">=3.8",
install_requires=[
"sentencepiece==0.2.0",
"gensim==4.3.2",
"sentencepiece>=0.2.0",
"gensim>=4.3.3",
"nltk",
"numpy",
"scipy==1.10.1",
"sklearn-crfsuite==0.3.6",
"tqdm==4.66.3",
"ftfy==6.2.0",
"emoji==1.7.0",
"scipy>=1.11.0",
"sklearn-crfsuite>=0.5.0",
"tqdm>=4.66.3",
"ftfy>=6.2.0",
"emoji>=2.0.0",
"requests",
"symspellpy>=6.7.0",
],
Expand Down
Empty file added tests/__init__.py
Empty file.
Empty file added tests/embedding/__init__.py
Empty file.
Empty file.
Empty file added tests/tokenizer/__init__.py
Empty file.
Loading