Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
138 commits
Select commit Hold shift + click to select a range
eff65ef
add script for downloading data set
OtterShadows Mar 12, 2026
f7125d2
Merge branch 'main' of https://github.com/OtterShadows/CS4300_The_Str…
OtterShadows Mar 12, 2026
1bbf089
Added language processing files
Jooosh01 Mar 12, 2026
dffcdf4
add json_to_csv function
OtterShadows Mar 12, 2026
29d0bdc
implemented charCounts, started sent-anal
Jooosh01 Mar 12, 2026
5a2babd
Merge branch 'main' of https://github.com/OtterShadows/CS4300_The_Str…
Jooosh01 Mar 12, 2026
c07ef3a
moved data set to data
Jooosh01 Mar 12, 2026
3d15bd0
char conuts data
Jooosh01 Mar 12, 2026
eb76210
filter out redundant json files
OtterShadows Mar 12, 2026
b0e5916
Merge branch 'main' of https://github.com/OtterShadows/CS4300_The_Str…
OtterShadows Mar 12, 2026
4ff8286
add score field to dataset
OtterShadows Mar 12, 2026
aa476d3
add controversiality field to csv
OtterShadows Mar 12, 2026
72c5fa0
minor bug fixes in character-counts.py
OtterShadows Mar 12, 2026
6b38979
minor bug fix in character-counts
OtterShadows Mar 12, 2026
bf509ad
Initial Front End Mockup
Gabby23x Mar 19, 2026
ec8c8ec
implement match name function
OtterShadows Mar 20, 2026
850c0fc
small fix
OtterShadows Mar 20, 2026
5b29959
implement functions 1.5 and 2 in similarity_calc.py
OtterShadows Mar 20, 2026
9e8f50a
key
Jooosh01 Mar 20, 2026
05c43d1
implement get_character_rating
OtterShadows Mar 20, 2026
9490730
lang processing csvs
Jooosh01 Mar 20, 2026
f681003
Merge remote-tracking branch 'origin/gabbys-frontendinit' into derek-…
OtterShadows Mar 20, 2026
7040b4d
Merge pull request #2 from OtterShadows/josh-langproc
Jooosh01 Mar 20, 2026
70a2974
Merge pull request #1 from OtterShadows/gabbys-frontendinit
Jooosh01 Mar 20, 2026
902053d
Skeleton of Character_Class
Jooosh01 Mar 20, 2026
f28e288
Merge branch 'main' into derek-backend
OtterShadows Mar 20, 2026
b96d118
Merge pull request #3 from OtterShadows/derek-backend
OtterShadows Mar 20, 2026
0336184
Provided function stubs and comments
Jooosh01 Mar 20, 2026
81adac7
class and sent anal done
OtterShadows Mar 21, 2026
39d67a3
attempts at showing rankings
OtterShadows Mar 21, 2026
74b5214
replace mock data with info from backend
OtterShadows Mar 23, 2026
bf46533
minor bug fix
OtterShadows Mar 23, 2026
8f29498
fix the character rating function
OtterShadows Mar 23, 2026
31d3cc2
add num_mentions function
OtterShadows Mar 23, 2026
490184f
attach dates to popularity plot
OtterShadows Mar 23, 2026
6ef3bd0
requirements update
Gabby23x Mar 23, 2026
6ec8be0
fix star rating distribution to be wider
OtterShadows Mar 23, 2026
179402c
search documents using corrected query
OtterShadows Mar 23, 2026
e8e224d
searching for name now requires "name:" in query
OtterShadows Mar 23, 2026
efcd641
Merge pull request #4 from OtterShadows/derek-backend
OtterShadows Mar 23, 2026
3d02c78
git nightmare incoming
Jooosh01 Mar 23, 2026
0eb368e
Underscores
Jooosh01 Mar 23, 2026
e7fca6d
baseline sim calc is done
Jooosh01 Mar 23, 2026
07d62ad
it is connected to the frontend
Jooosh01 Mar 24, 2026
287a8bf
Merge branch 'front-end-anal' into search_hotfix_PR3
Jooosh01 Mar 24, 2026
67b4c7d
Merge pull request #5 from OtterShadows/search_hotfix_PR3
Jooosh01 Mar 24, 2026
1bbfc20
Merge branch 'main' into front-end-anal
Jooosh01 Mar 24, 2026
1d3b0c4
Revert "Merge branch 'main' into front-end-anal"
Jooosh01 Mar 24, 2026
87af0da
Revert "searching for name now requires "name:" in query"
Gabby23x Mar 24, 2026
e04d281
Update character-search.html
Jooosh01 Mar 24, 2026
f172b99
Merge pull request #6 from OtterShadows/front-end-anal
Jooosh01 Mar 24, 2026
05bdbb4
:(
Jooosh01 Mar 24, 2026
7e3404c
Merge branch 'main' into prototype
Gabby23x Mar 24, 2026
e14421f
Merge pull request #7 from OtterShadows/prototype
Gabby23x Mar 24, 2026
4e57258
bugs in sijm calc
Jooosh01 Mar 24, 2026
f1e58ae
requirements
Jooosh01 Mar 24, 2026
fe058d4
Revert "searching for name now requires "name:" in query"
Gabby23x Mar 24, 2026
2e9be00
Merge pull request #8 from OtterShadows/derek-backend
Gabby23x Mar 24, 2026
52a449c
num mentions
Jooosh01 Mar 24, 2026
da12557
Finished Character Class and Pickled
Jooosh01 Mar 24, 2026
9a0ce00
Front end connection
Jooosh01 Mar 24, 2026
6155135
Do not strip or lowercase names
Jooosh01 Mar 26, 2026
837c461
debugging and slight syntax change
Jooosh01 Mar 26, 2026
6caeb68
fix "fail to fetch character"
OtterShadows Mar 27, 2026
aeb6e1a
get graph to display data from backend
OtterShadows Mar 27, 2026
1402233
names and mentions working
Jooosh01 Mar 27, 2026
cc7937e
Merge branch 'jw-derek-josh-meth' into derek-backend
OtterShadows Mar 27, 2026
06adece
get rid of character-counts.py
OtterShadows Mar 27, 2026
106b5b6
build reverse postings handling aliases
OtterShadows Mar 28, 2026
168d4a4
(buggy) connect alias search to frontend
OtterShadows Mar 28, 2026
18632a3
Merge branch 'main' into jw-derek-josh-meth
Jooosh01 Mar 28, 2026
365bc19
Merge pull request #9 from OtterShadows/jw-derek-josh-meth
Jooosh01 Mar 28, 2026
c0b685d
change sklearn to scikit-learn in requirements.txt
OtterShadows Mar 30, 2026
df39a76
Merge pull request #10 from OtterShadows/deploy-error-hotfix
OtterShadows Mar 30, 2026
77fe0dc
add specs to similarity_calc.py functions
OtterShadows Apr 1, 2026
6a7e75b
reorganize file structure with csv files to make less confusing
OtterShadows Apr 1, 2026
fee5cff
rebuild model with similarity_calc.py
OtterShadows Apr 1, 2026
60bf229
remove helper functions and add specs
OtterShadows Apr 1, 2026
cedbf6d
add code to run character_counts.py functions
OtterShadows Apr 1, 2026
8d963eb
fix dependency issues
OtterShadows Apr 4, 2026
319530a
comment in routes.py for fixing search function
OtterShadows Apr 4, 2026
24eedc4
retrieve most similar character based on cossim with query
OtterShadows Apr 5, 2026
ac8506e
add functions in similarity_calc to calculate most relevant comments …
OtterShadows Apr 5, 2026
975857e
incorporate similarity score of comment into comment class
OtterShadows Apr 6, 2026
ce4da40
frontend now shows similarity ranked comments
OtterShadows Apr 6, 2026
481eb1d
minor format changes to retrieved comment section
OtterShadows Apr 6, 2026
b84b7d5
truly nothing important
OtterShadows Apr 6, 2026
e4ea4cc
Merge branch 'main' into derek-backend
Gabby23x Apr 8, 2026
176e547
Merge pull request #11 from OtterShadows/derek-backend
Gabby23x Apr 8, 2026
0a37b09
added requirement for deployment
Gabby23x Apr 8, 2026
e8fa038
Merge pull request #12 from OtterShadows/main-requirement
Gabby23x Apr 8, 2026
2c086ad
docker changes
Gabby23x Apr 8, 2026
5aa03ec
Merge pull request #13 from OtterShadows/main-requirement
Gabby23x Apr 8, 2026
11d2268
Revert "docker changes"
Gabby23x Apr 8, 2026
a0e4823
Merge pull request #14 from OtterShadows/revert-13-main-requirement
Gabby23x Apr 8, 2026
2f37c7d
One Piece Photo Grab Implementation
Gabby23x Apr 9, 2026
f888ceb
Merge branch 'main' into main-requirement
Gabby23x Apr 9, 2026
4e26bbc
Merge branch 'derek-backend' of https://github.com/OtterShadows/CS430…
OtterShadows Apr 10, 2026
c0fdb82
comment out make_pickle() for the model
OtterShadows Apr 10, 2026
116f6ff
replace reference to rp with absolute path
OtterShadows Apr 10, 2026
728795f
Merge pull request #16 from OtterShadows/deploy-error-hotfix
OtterShadows Apr 10, 2026
912d72a
resolve rest of relative references to absolute (i think)
OtterShadows Apr 10, 2026
2e93c6f
Merge pull request #17 from OtterShadows/deploy-error-hotfix
OtterShadows Apr 10, 2026
38306d2
Merge branch 'main' into derek-backend
OtterShadows Apr 10, 2026
9edfb4e
fix reference to absolute
OtterShadows Apr 10, 2026
1ea75f1
add helper dict and list for alias accounting
OtterShadows Apr 10, 2026
500264b
check against aliases in query_character function
OtterShadows Apr 10, 2026
66d7eb1
hopefully fix nltk vader lexicon not found error
OtterShadows Apr 10, 2026
ac3803b
Merge branch 'main' into main-requirement
OtterShadows Apr 10, 2026
e0d4fb7
Merge pull request #15 from OtterShadows/main-requirement
OtterShadows Apr 10, 2026
cc67597
merge changes
OtterShadows Apr 11, 2026
b6b2062
Merge pull request #20 from OtterShadows/main-merge-fix
OtterShadows Apr 11, 2026
21901cf
Merge branch 'derek-backend' into merge-backend-and-main
OtterShadows Apr 11, 2026
34f1262
rest of merge
OtterShadows Apr 11, 2026
ef483a3
fix backend error that i caused in derek-backend
OtterShadows Apr 11, 2026
ab442ba
Merge pull request #21 from OtterShadows/merge-backend-and-main
OtterShadows Apr 11, 2026
b3cb159
changed summary message
MaureenKaminja Apr 11, 2026
929dd32
Merge branch 'main' of https://github.com/OtterShadows/CS4300_The_Str…
MaureenKaminja Apr 11, 2026
1141add
fsummary
MaureenKaminja Apr 11, 2026
2e772c1
cleaner version
MaureenKaminja Apr 11, 2026
b297bbb
filtered comments with one word
MaureenKaminja Apr 11, 2026
fda49f9
making a module
MaureenKaminja Apr 11, 2026
b3de007
change to joblib
MaureenKaminja Apr 11, 2026
ce18d14
add line to remove one word comment
MaureenKaminja Apr 11, 2026
16c679e
cleaner filter
MaureenKaminja Apr 11, 2026
a10a778
Auto regenerate pickle file
Gabby23x Apr 11, 2026
38023b2
Merge branch 'feature/my-change' of https://github.com/OtterShadows/C…
Gabby23x Apr 11, 2026
01ff89f
moved filter to character class
MaureenKaminja Apr 11, 2026
2db49e3
Merge branch 'feature/my-change' of https://github.com/OtterShadows/C…
MaureenKaminja Apr 11, 2026
660ba3f
small edit
MaureenKaminja Apr 11, 2026
6b2b8fb
more edits to character class
MaureenKaminja Apr 11, 2026
c7f6319
character class edit
MaureenKaminja Apr 12, 2026
89903e6
fixing nonetype errors
MaureenKaminja Apr 12, 2026
7b9c6f0
get ratings graph to display again
OtterShadows Apr 12, 2026
2b81b7f
implemented a rag file
MaureenKaminja Apr 22, 2026
9f2c375
Merge branch 'feature/my-change' of https://github.com/OtterShadows/C…
MaureenKaminja Apr 22, 2026
2550242
minor change
MaureenKaminja Apr 22, 2026
04f9a18
final
MaureenKaminja Apr 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python Debugger: Flask",
"type": "debugpy",
"request": "launch",
"module": "flask",
"env": {
"FLASK_APP": "src\\app.py",
"FLASK_DEBUG": "1"
},
"args": [
"run",
"--no-debugger",
"--no-reload"
],
"jinja": true
}
]
}
38 changes: 38 additions & 0 deletions regenerate_character_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/env python
"""Script to regenerate character_data.pkl from the latest character_class code."""

import os
import sys
import joblib

# Add src to path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'src'))

from language_processing import character_class

# Load data
print("Loading comment and posting data...")
comments_df, postings_df = character_class.load_data()

# Create all characters
print("Creating character objects...")
characters = character_class.create_all_characters(postings_df, comments_df)

# Convert to dict
print("Converting characters to dictionary format...")
char_dict = character_class.characters_to_dict(characters)

# Save to pickle file
output_path = os.path.join(
os.path.dirname(__file__),
'src',
'language_processing',
'data',
'character_data.pkl'
)

print(f"Saving character data to {output_path}...")
joblib.dump(char_dict, output_path)

print("✅ Successfully regenerated character_data.pkl!")
print(f"Total characters: {len(char_dict)}")
10 changes: 10 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,13 @@ Werkzeug==2.2.2
gunicorn==25.0.3
python-dotenv
git+https://github.com/MrPeterss/infosci-spark-client.git
nltk
pandas
scikit-learn
spacy
numpy
joblib
datetime
rapidfuzz
requests

1,279 changes: 1,279 additions & 0 deletions reverse_postings.csv

Large diffs are not rendered by default.

Empty file.
227 changes: 227 additions & 0 deletions src/language_processing/character_class.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
import os
import pandas as pd
from language_processing import sent_anal
from datetime import datetime
import re
import joblib
from language_processing.rag import generate_character_summary
# from src.language_processing import similarity_calc

#To speed up multiple calls of the functions.
comment_cache = {}
# comments is a csv with columns id, timestamp, score, controversiality, text
# comments_df = pd.read_csv("data/piratefolk_comments.csv")
# postings is a csv with columns character, comment_ids (comma separated)
# postings_df_path = pd.read_csv("src/language_processing/csv/reverse_postings_alias_exact.csv")
def load_data():
current_dir = os.path.dirname(os.path.abspath(__file__))

comments_df = pd.read_csv(
os.path.join(current_dir, "data", "piratefolk_comments.csv")
).set_index("id")

def is_valid(text):
words = re.findall(r'\b\w+\b', str(text))
return len(words) > 1

comments_df = comments_df[comments_df["text"].apply(is_valid)]

postings_df = pd.read_csv(
os.path.join(current_dir, "csv", "reverse_postings_alias_exact.csv")
).drop_duplicates(subset="character").set_index("character")

valid_ids = set(comments_df.index)

postings_df["comment_ids"] = postings_df["comment_ids"].apply(
lambda x: ",".join(cid for cid in x.split(",") if cid in valid_ids)
)

return comments_df, postings_df

class Rating:
def __init__(self, date, rating, sentiment):
self.date = date
self.rating = rating
#sent anal positive, negative, neutral
assert sentiment in ["positive", "negative", "neutral"]
self.sentiment = sentiment


class Comment:
def __init__(self, user, text, sentiment, rating=None, score=None, timestamp=None, controversiality=None, sim_score=None):
self.user = user
self.text = text
self.sentiment = sentiment
#do we have this elsewhere when you set get comment
self.rating = rating
self.score = score
self.timestamp = timestamp
self.controversiality = controversiality
self.sim_score = sim_score


# new version of get_comment to account for similarity score
def create_comment(id, sim_score, comments_df):
if id in comment_cache:
return comment_cache[id]
if id not in comments_df.index:
return None
row = comments_df.loc[id]
if isinstance(row, pd.DataFrame):
row = row.iloc[0]
text = str(row["text"])
sentiment = sent_anal.get_sentiment(text)
comment = Comment(
user=None,
text=text,
sentiment=sentiment,
rating=0,
score=float(row["score"]),
sim_score=sim_score,
timestamp=float(row["timestamp"]),
controversiality=int(row["controversiality"])
)

comment_cache[id] = comment
return comment


#uses character name to create the rating over time using character name
def get_rating_over_time(charName, postings_df, comments_df):
#get all comments for the character, then create a list of ratings over time
ids = postings_df.loc[charName, "comment_ids"]

if not isinstance(ids, str) or not ids.strip():
return []
comment_ids = [cid for cid in ids.split(",") if cid.strip()]

#make list of comment objects using get_comment function then sort by timestamp
comments = sorted(
[
c for c in (create_comment(cid, 0, comments_df) for cid in comment_ids)
if c is not None
],
key=lambda x: x.timestamp)
ratings_over_time = []
init_score = 100
for comment in comments:
date = datetime.fromtimestamp(comment.timestamp)
if comment.sentiment == "positive":
init_score += 20
elif comment.sentiment == "negative":
init_score -= 20
#neutral does not change the score
#print(f"Date: {date}, Sentiment: {sentiment}, Rating: {rating}")
ratings_over_time.append(Rating(datetime.fromtimestamp(comment.timestamp), init_score, comment.sentiment))
return sorted(ratings_over_time, key=lambda r: r.date)
#get_rating_over_time("Jika")

class Character:
def __init__(self, name, rank,total_comments, sentiment, sentiment_score, summary,
ratings_over_time=None, comments=None, retrieved=None):
self.name = name
#some trivial pattern matching function
self.rank = rank
#the sentiment that has the most in ratings over time
self.sentiment = sentiment
#make some metric using the enum
self.sentiment_score = sentiment_score
#ong put dummy data here for now
self.summary = summary
#complete ratings_over_time function
self.ratings_over_time = ratings_over_time if ratings_over_time is not None else []
#should be rating of final rating in ratings over time
self.current_rating = self.ratings_over_time[len(self.ratings_over_time)-1].rating if self.ratings_over_time else 0
self.comments = comments if comments is not None else []
self.total_comments = total_comments
#ask gabby what the difference was supposed to be... might be the ranked most relevant comments?
self.retrieved = retrieved if retrieved is not None else[]


def create_character(name, postings_df, comments_df, use_llm_summary=False):
ids = postings_df.loc[name, "comment_ids"]

if isinstance(ids, pd.Series):
ids = ids.iloc[0]

if not isinstance(ids, str) or not ids.strip():
comment_ids = []
else:
comment_ids = [cid for cid in ids.split(",") if cid.strip()]

comment_list = [
c for c in (create_comment(cid, 0, comments_df) for cid in comment_ids)
if c is not None
]

ratings_over_time = get_rating_over_time(name, postings_df, comments_df)

pos = sum(1 for c in comment_list if c.sentiment == "positive")
neg = sum(1 for c in comment_list if c.sentiment == "negative")

if pos > neg:
summary = f"There is an overall positive sentiment with {pos} positive vs {neg} negative comments."
elif neg > pos:
summary = f"There is an overall negative sentiment with {neg} negative vs {pos} positive comments."
else:
summary = "We found that there is mixed sentiment from the community."

# compute score + sentiment
sentiment_score = ratings_over_time[-1].rating if ratings_over_time else 100
sentiment = comment_list[-1].sentiment if comment_list else "neutral"

# ranking
rank = (
"A" if sentiment_score > 100
else "C" if sentiment_score < 80
else "B"
)

character = Character(
name,
rank,
len(comment_list),
sentiment,
sentiment_score,
summary,
ratings_over_time,
comment_list,
comment_list[:5]
)

if use_llm_summary:
try:
character.summary = generate_character_summary(character)
except Exception:
pass # keep fallback summary if LLM fails

return character
#create_character("Jika")


def create_all_characters(postings_df, comments_df):
return[create_character(name, postings_df, comments_df) for name in postings_df.index]

def characters_to_dict(characters):
char_dict = {}
for character in characters:
char_dict[character.name] = {
"rank": character.rank,
"total_comments": character.total_comments,
"sentiment": character.sentiment,
"currentRating": character.sentiment_score,
"summary": character.summary,
#ratings as a list of dicts
# "ratings_over_time": [(r.date.timestamp(), r.rating) for r in character.ratings_over_time],
"ratings_over_time": [{"date": r.date.timestamp(), "rating": r.rating, "sentiment": r.sentiment} for r in character.ratings_over_time],
#comments as a list of dicts
"comments": [{"user": c.user, "text": c.text, "sentiment": c.sentiment, "rating": c.rating, "score": c.score, "timestamp": c.timestamp, "controversiality": c.controversiality} for c in character.comments],
#retrieved as a list of dicts
"retrieved": [{"user": c.user, "text": c.text, "sentiment": c.sentiment, "rating": c.rating, "score": c.score, "timestamp": c.timestamp, "controversiality": c.controversiality} for c in character.retrieved]
}
return char_dict
#print(create_all_characters())
# comments_df, postings_df = load_data()
# joblib.dump(characters_to_dict(create_all_characters(postings_df, comments_df)), "src/language_processing/data/character_data.pkl")


Loading