Skip to content

Research

ddayto edited this page Feb 5, 2025 · 2 revisions

Pre-Trained Large Language Models (LLMs)

To extract keywords from user query input, we will use pre-trained Large Language Models (LLMs) that are specifically designed for natural language understanding (NLU) tasks like keyword extraction, intent detection, and entity recognition.

Below are some specific LLMs and tools I considered using, along their tradeoffs.

Open-Source Models

I decided to fine-tune a pre-trained model (e.g., BERT, GPT) on a dataset of book-related queries.

Example Dataset:

[
  {
    "query": "Find me a sci-fi book about space exploration.",
    "keywords": ["sci-fi", "space exploration"]
  },
  {
    "query": "I want a romance novel set in Italy.",
    "keywords": ["romance", "Italy"]
  }
]

Scientific Articles

NER Datasets

  • CoNLL03 (Sang and De Meulder, 2003) - Language-Independent Named Entity Recognition PDF

Findings

Search Queries Dataset: ORCAS Dataset (Craswell et al., 2020)

  • Human generated machine reading comprehension dataset

  • Entity types (e.g., creative works) can be linguistically complex. They can be complex noun phrases (Eternal Sunshine of the Spotless Mind), gerunds (Saving Private Ryan), infinitives (To Kill a Mockingbird), or full clauses (Mr. Smith Goes to Washington). Syntactic parsing of such nouns is hard, and most current parsers and NER systems fail to recognize them

  • MULTICONER (WNUT Taxonomy Entity Types)

  • creative work entities (CREATIVE-WORK (CW, movie/song/book titles))

Fine-Tuning

The objective is to fine-tune a BERT model for Named Entity Recognition (NER) for book search. This will require datasets containing labeled entities (e.g., TITLE, AUTHOR, GENRE).

Below is the approach I used to finding, curating, and preparing the best dataset for our use case.

Search for Pre-Labeled NER Datasets

I searched for existing NER datasets that contain book-related entities, prioritizing datasets labeled with titles, authors, genres, places, and publishers.

Hugging Face Datasets

https://huggingface.co/datasets

Named Entity Recognition for Book Search

Primary Entities to Extract

Entity Label Description OpenLibraryBook Field
TITLE Book title title
AUTHOR Author name author_name, author_key
GENRE Book subject/category subject
PUBLISHER Publisher publisher
FORMAT Book format (e.g., Paperback) format
LANGUAGE Language of the book language
PLACE Book setting or publishing location place, publish_place
CHARACTER Important book characters person
YEAR First publication year first_publish_year, publish_year
ISBN ISBN identifier isbn

Example Training Data for Fine-Tuning

User Query Labeled Entities
"Find me a **mystery novel** by **Agatha Christie**" GENRE: "mystery novel", AUTHOR: "Agatha Christie"
"Books published in **France** in **1990**" PLACE: "France", YEAR: "1990"
"Show books written in **Spanish**" LANGUAGE: "Spanish"
"Find **science fiction** books set in **Mars**" GENRE: "science fiction", PLACE: "Mars"
"Give me books by **J.K. Rowling** published by **Scholastic**" AUTHOR: "J.K. Rowling", PUBLISHER: "Scholastic"

Initialize React App

npx create-react-app frontend --use-yarn --template cra-template --skip-install```

Install Swagger Editor CLI

yarn add swagger-cli --dev

Validate Changes Consistently

As changes are made to the backend, revalidate the openapi.yaml file after any updates:

swagger-cli validate openapi.yaml
fastapi dev app/main.py

http://127.0.0.1:8000/docs http://127.0.0.1:8000/redoc

Add TypeScript Support

yarn add typescript @types/react @types/react-dom

ngs&page=2 https://openlibrary.org/search/authors.json?q=twain

+----------------------------------------------------+
|  [Book Cover]    "The Lord of the Rings"          |
|                 by J.R.R. Tolkien                 |
|----------------------------------------------------|
| πŸ† 250 Editions  πŸ“– 1193 Pages  πŸŽ™οΈ Audio Available |
| 🌎 Available in: EN, FR, SP, IT, DE...            |
| πŸ“… First Published: 1954                          |
| πŸ”– Subjects: Fantasy, Middle-Earth, Magic         |
|----------------------------------------------------|
| πŸ“₯ Read Online  πŸ”— Borrow  ❀️ Add to Favorites    |
+----------------------------------------------------+

Clickable Tags for Filtering

  • Use auto-generated tags from subjects & characters.
  • Clicking a tag refines search results dynamically.

Workflow

  1. User Query: "Find me a mystery novel set in Paris with a unique protagonist."

  2. Backend Processing

  • LLM extracts keywords: genre=mystery, location=Paris, character=unique protagonist.

  • OpenLibrary API fetches books matching these criteria.

  • LLM summarizes the book descriptions and generates a natural language response.

Clone this wiki locally