-
Notifications
You must be signed in to change notification settings - Fork 0
Research
To extract keywords from user query input, we will use pre-trained Large Language Models (LLMs) that are specifically designed for natural language understanding (NLU) tasks like keyword extraction, intent detection, and entity recognition.
Below are some specific LLMs and tools I considered using, along their tradeoffs.
I decided to fine-tune a pre-trained model (e.g., BERT, GPT) on a dataset of book-related queries.
Example Dataset:
[
{
"query": "Find me a sci-fi book about space exploration.",
"keywords": ["sci-fi", "space exploration"]
},
{
"query": "I want a romance novel set in Italy.",
"keywords": ["romance", "Italy"]
}
]- When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively arxiv
- Large Language Models for Information Retrieval arxiv
- Query Reformulation for Dynamic Information Integration sci-hub
- Analyzing and evaluating query reformulation strategies in web search logs (https://dl.acm.org/doi/abs/10.1145/1645953.1645966?casa_token=HARxSaPwK6QAAAAA:TBTQ4LyQO_34D_OikO6qyQx2ZrKZCotNyFApundsVYMDH3UrT6B7cFRVJAVNR08sBp7iSetubBy8)
- BM25 algorithm
- Webgpt: Browser-assisted question-answering with human feedback arxiv
- Improving language models by retrieving from trillions of tokens International Conference on Machine Learning
- MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition arxiv
- SemEval-2022 Task 11 on Multilingual Complex Named Entity Recognition (MultiCoNER) acl
- Generalisation in named entity recognition: A quantitative analysis
- COGCOMPNLP: Your Swiss Army Knife for NLP (https://aclanthology.org/L18-1086.pdf): library used to simplify design and development process of NLP applications.
- http://github.com/CogComp/cogcomp-nlp
- search for text modules
- Query analysis with structural templates Apple
- Comparative Analysis of Neural QA models on SQuAD
- CoNLL03 (Sang and De Meulder, 2003) - Language-Independent Named Entity Recognition PDF
Search Queries Dataset: ORCAS Dataset (Craswell et al., 2020)
-
Human generated machine reading comprehension dataset
-
Entity types (e.g., creative works) can be linguistically complex. They can be complex noun phrases (Eternal Sunshine of the Spotless Mind), gerunds (Saving Private Ryan), infinitives (To Kill a Mockingbird), or full clauses (Mr. Smith Goes to Washington). Syntactic parsing of such nouns is hard, and most current parsers and NER systems fail to recognize them
-
MULTICONER (WNUT Taxonomy Entity Types)
-
creative work entities (CREATIVE-WORK (CW, movie/song/book titles))
The objective is to fine-tune a BERT model for Named Entity Recognition (NER) for book search. This will require datasets containing labeled entities (e.g., TITLE, AUTHOR, GENRE).
Below is the approach I used to finding, curating, and preparing the best dataset for our use case.
I searched for existing NER datasets that contain book-related entities, prioritizing datasets labeled with titles, authors, genres, places, and publishers.
https://huggingface.co/datasets
-
bookcorpus - https://huggingface.co/datasets/bookcorpus/bookcorpus
-
Web Scraping: https://github.com/BIGBALLON/cifar-10-cnn
-
Kaggle: Search for NER datasets for books, literature, or authors:
-
Google Dataset Search
| Entity Label | Description | OpenLibraryBook Field |
|---|---|---|
TITLE |
Book title | title |
AUTHOR |
Author name |
author_name, author_key
|
GENRE |
Book subject/category | subject |
PUBLISHER |
Publisher | publisher |
FORMAT |
Book format (e.g., Paperback) | format |
LANGUAGE |
Language of the book | language |
PLACE |
Book setting or publishing location |
place, publish_place
|
CHARACTER |
Important book characters | person |
YEAR |
First publication year |
first_publish_year, publish_year
|
ISBN |
ISBN identifier | isbn |
| User Query | Labeled Entities |
|---|---|
"Find me a **mystery novel** by **Agatha Christie**" |
GENRE: "mystery novel", AUTHOR: "Agatha Christie"
|
"Books published in **France** in **1990**" |
PLACE: "France", YEAR: "1990"
|
"Show books written in **Spanish**" |
LANGUAGE: "Spanish" |
"Find **science fiction** books set in **Mars**" |
GENRE: "science fiction", PLACE: "Mars"
|
"Give me books by **J.K. Rowling** published by **Scholastic**" |
AUTHOR: "J.K. Rowling", PUBLISHER: "Scholastic"
|
npx create-react-app frontend --use-yarn --template cra-template --skip-install```yarn add swagger-cli --dev
As changes are made to the backend, revalidate the openapi.yaml file after any updates:
swagger-cli validate openapi.yaml
fastapi dev app/main.pyhttp://127.0.0.1:8000/docs http://127.0.0.1:8000/redoc
yarn add typescript @types/react @types/react-domngs&page=2 https://openlibrary.org/search/authors.json?q=twain
+----------------------------------------------------+
| [Book Cover] "The Lord of the Rings" |
| by J.R.R. Tolkien |
|----------------------------------------------------|
| π 250 Editions π 1193 Pages ποΈ Audio Available |
| π Available in: EN, FR, SP, IT, DE... |
| π
First Published: 1954 |
| π Subjects: Fantasy, Middle-Earth, Magic |
|----------------------------------------------------|
| π₯ Read Online π Borrow β€οΈ Add to Favorites |
+----------------------------------------------------+
- Use auto-generated tags from subjects & characters.
- Clicking a tag refines search results dynamically.
-
User Query: "Find me a mystery novel set in Paris with a unique protagonist."
-
Backend Processing
-
LLM extracts keywords: genre=mystery, location=Paris, character=unique protagonist.
-
OpenLibrary API fetches books matching these criteria.
-
LLM summarizes the book descriptions and generates a natural language response.