Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions collection.go
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,20 @@ func (c *Collection) GetByID(ctx context.Context, id string) (Document, error) {
return Document{}, fmt.Errorf("document with ID '%v' not found", id)
}

// ListDocuments returns all documents in the collection.
// The returned documents are a copy of the original documents, so they can be safely
// modified without affecting the collection.
Comment on lines +321 to +322

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice is new and can be modified without affecting the internal c.documents, but the documents themselves are not full copies. When you do x := y, then y's simple fields (int, string) are entirely new, but maps and slices are only_shallow_ copies.

Demonstration: https://go.dev/play/p/0OccI4ibtS2

See the above GetByID where the Metadata and Embedding fields are cloned separately to create an entirely new document.

So here we have two options:

  • Change the Godoc to clarify that the documents are shallow copies, and only the slice is new. This still allows the receiver to work with the slice, like iterating over it and reading the documents, without concurrency issues during regular operations. For example chromem-go can still add new documents to its c.Documents map, or delete them, and it doesn't affect the returned slice. Here's an example in chromem-go where something similar is done:

    chromem-go/db.go

    Lines 517 to 522 in 8311eb0

    // The returned map is a copy of the internal map, so it's safe to directly modify
    // the map itself. Direct modifications of the map won't reflect on the DB's map.
    // To do that use the DB's methods like [DB.CreateCollection] and [DB.DeleteCollection].
    // The map is not an entirely deep clone, so the collections themselves are still
    // the original ones. Any methods on the collections like Add() for adding documents
    // will be reflected on the DB's collections and are concurrency-safe.
  • Or create a deep copy of documents. This can either be done by calling the GetByID for each document, or by copying the code from that method. The former leads to less code, but one extra operation per document (the c.Documents lookup).

func (c *Collection) ListDocuments() ([]Document, error) {
c.documentsLock.RLock()
defer c.documentsLock.RUnlock()

docs := make([]Document, 0, len(c.documents))
for _, doc := range c.documents {
docs = append(docs, *doc)
}
return docs, nil
}

// Delete removes document(s) from the collection.
//
// - where: Conditional filtering on metadata. Optional.
Expand Down
34 changes: 34 additions & 0 deletions collection_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -477,6 +477,40 @@
}
}

func TestCollection_ListDocuments(t *testing.T) {
// Create collection
db := NewDB()
name := "test"
metadata := map[string]string{"foo": "bar"}
vectors := []float32{-0.40824828, 0.40824828, 0.81649655} // normalized version of `{-0.1, 0.1, 0.2}`
embeddingFunc := func(_ context.Context, _ string) ([]float32, error) {
return vectors, nil
}
c, _ := db.CreateCollection(name, metadata, embeddingFunc)

// Add documents
ids := []string{"1", "2", "3", "4"}
metadatas := []map[string]string{{"foo": "bar"}, {"a": "b"}, {"foo": "bar"}, {"e": "f"}}
contents := []string{"hello world", "hallo welt", "bonjour le monde", "hola mundo"}
c.Add(context.Background(), ids, nil, metadatas, contents)

Check failure on line 495 in collection_test.go

View workflow job for this annotation

GitHub Actions / lint (1.23)

Error return value of `c.Add` is not checked (errcheck)

Check failure on line 495 in collection_test.go

View workflow job for this annotation

GitHub Actions / lint (1.22)

Error return value of `c.Add` is not checked (errcheck)

Check failure on line 495 in collection_test.go

View workflow job for this annotation

GitHub Actions / lint (1.21)

Error return value of `c.Add` is not checked (errcheck)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the returned error should be checked


// Get all documents
docs, err := c.ListDocuments()
if err != nil {
t.Fatal("expected nil, got", err)
}
if len(docs) != 4 {
t.Fatal("expected 4 documents, got", len(docs))
}

// since the documents are returned in a random order, we just check if there is a document with the content "hello world"
for _, doc := range docs {
if doc.Content == "hello world" {
break
}
}
Comment on lines +507 to +511

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the test doesn't assert whether the content was found or not. You can introduce a new variable found := false before the loop, set it found = true just before the break, and after the loop assert that its value is true.

}

func TestCollection_Delete(t *testing.T) {
// Create persistent collection
tmpdir, err := os.MkdirTemp(os.TempDir(), "chromem-test-*")
Expand Down
Loading