Research

Pre-Trained Large Language Models (LLMs)

To extract keywords from user query input, we will use pre-trained Large Language Models (LLMs) that are specifically designed for natural language understanding (NLU) tasks like keyword extraction, intent detection, and entity recognition.

Below are some specific LLMs and tools I considered using, along their tradeoffs.

Open-Source Models

I decided to fine-tune a pre-trained model (e.g., BERT, GPT) on a dataset of book-related queries.

Example Dataset:

[
  {
    "query": "Find me a sci-fi book about space exploration.",
    "keywords": ["sci-fi", "space exploration"]
  },
  {
    "query": "I want a romance novel set in Italy.",
    "keywords": ["romance", "Italy"]
  }
]

Scientific Articles

When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively arxiv
Large Language Models for Information Retrieval arxiv
Query Reformulation for Dynamic Information Integration sci-hub
Analyzing and evaluating query reformulation strategies in web search logs (https://dl.acm.org/doi/abs/10.1145/1645953.1645966?casa_token=HARxSaPwK6QAAAAA:TBTQ4LyQO_34D_OikO6qyQx2ZrKZCotNyFApundsVYMDH3UrT6B7cFRVJAVNR08sBp7iSetubBy8)
BM25 algorithm
Webgpt: Browser-assisted question-answering with human feedback arxiv
Improving language models by retrieving from trillions of tokens International Conference on Machine Learning
MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition arxiv
SemEval-2022 Task 11 on Multilingual Complex Named Entity Recognition (MultiCoNER) acl
Generalisation in named entity recognition: A quantitative analysis
COGCOMPNLP: Your Swiss Army Knife for NLP (https://aclanthology.org/L18-1086.pdf): library used to simplify design and development process of NLP applications.
- http://github.com/CogComp/cogcomp-nlp
- search for text modules
Query analysis with structural templates Apple
Comparative Analysis of Neural QA models on SQuAD

NER Datasets

CoNLL03 (Sang and De Meulder, 2003) - Language-Independent Named Entity Recognition PDF

Findings

Search Queries Dataset: ORCAS Dataset (Craswell et al., 2020)

Human generated machine reading comprehension dataset
Entity types (e.g., creative works) can be linguistically complex. They can be complex noun phrases (Eternal Sunshine of the Spotless Mind), gerunds (Saving Private Ryan), infinitives (To Kill a Mockingbird), or full clauses (Mr. Smith Goes to Washington). Syntactic parsing of such nouns is hard, and most current parsers and NER systems fail to recognize them
MULTICONER (WNUT Taxonomy Entity Types)
creative work entities (CREATIVE-WORK (CW, movie/song/book titles))

Fine-Tuning

The objective is to fine-tune a BERT model for Named Entity Recognition (NER) for book search. This will require datasets containing labeled entities (e.g., TITLE, AUTHOR, GENRE).

Below is the approach I used to finding, curating, and preparing the best dataset for our use case.

Search for Pre-Labeled NER Datasets

I searched for existing NER datasets that contain book-related entities, prioritizing datasets labeled with titles, authors, genres, places, and publishers.

Hugging Face Datasets

https://huggingface.co/datasets

bookcorpus - https://huggingface.co/datasets/bookcorpus/bookcorpus
https://www.smashwords.com/
Web Scraping: https://github.com/BIGBALLON/cifar-10-cnn
Kaggle: Search for NER datasets for books, literature, or authors:
Google Dataset Search

Named Entity Recognition for Book Search

Primary Entities to Extract

Entity Label	Description	OpenLibraryBook Field
`TITLE`	Book title	`title`
`AUTHOR`	Author name	`author_name`, `author_key`
`GENRE`	Book subject/category	`subject`
`PUBLISHER`	Publisher	`publisher`
`FORMAT`	Book format (e.g., Paperback)	`format`
`LANGUAGE`	Language of the book	`language`
`PLACE`	Book setting or publishing location	`place`, `publish_place`
`CHARACTER`	Important book characters	`person`
`YEAR`	First publication year	`first_publish_year`, `publish_year`
`ISBN`	ISBN identifier	`isbn`

Example Training Data for Fine-Tuning

User Query	Labeled Entities
`"Find me a mystery novel by Agatha Christie"`	`GENRE: "mystery novel"`, `AUTHOR: "Agatha Christie"`
`"Books published in France in 1990"`	`PLACE: "France"`, `YEAR: "1990"`
`"Show books written in Spanish"`	`LANGUAGE: "Spanish"`
`"Find science fiction books set in Mars"`	`GENRE: "science fiction"`, `PLACE: "Mars"`
`"Give me books by J.K. Rowling published by Scholastic"`	`AUTHOR: "J.K. Rowling"`, `PUBLISHER: "Scholastic"`

Initialize React App

npx create-react-app frontend --use-yarn --template cra-template --skip-install```

Install Swagger Editor CLI

yarn add swagger-cli --dev

Validate Changes Consistently

As changes are made to the backend, revalidate the openapi.yaml file after any updates:

swagger-cli validate openapi.yaml

fastapi dev app/main.py

http://127.0.0.1:8000/docs http://127.0.0.1:8000/redoc

Add TypeScript Support

yarn add typescript @types/react @types/react-dom

ngs&page=2 https://openlibrary.org/search/authors.json?q=twain

+----------------------------------------------------+
|  [Book Cover]    "The Lord of the Rings"          |
|                 by J.R.R. Tolkien                 |
|----------------------------------------------------|
| 🏆 250 Editions  📖 1193 Pages  🎙️ Audio Available |
| 🌎 Available in: EN, FR, SP, IT, DE...            |
| 📅 First Published: 1954                          |
| 🔖 Subjects: Fantasy, Middle-Earth, Magic         |
|----------------------------------------------------|
| 📥 Read Online  🔗 Borrow  ❤️ Add to Favorites    |
+----------------------------------------------------+

Clickable Tags for Filtering

Use auto-generated tags from subjects & characters.
Clicking a tag refines search results dynamically.

Workflow

User Query: "Find me a mystery novel set in Paris with a unique protagonist."
Backend Processing

LLM extracts keywords: genre=mystery, location=Paris, character=unique protagonist.
OpenLibrary API fetches books matching these criteria.
LLM summarizes the book descriptions and generates a natural language response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research

Pre-Trained Large Language Models (LLMs)

Open-Source Models

Scientific Articles

NER Datasets

Findings

Fine-Tuning

Search for Pre-Labeled NER Datasets

Hugging Face Datasets

Named Entity Recognition for Book Search

Primary Entities to Extract

Example Training Data for Fine-Tuning

Initialize React App

Install Swagger Editor CLI

Validate Changes Consistently

Add TypeScript Support

Clickable Tags for Filtering

Workflow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally