POC of Version Store with pluggable backend#586
Conversation
First implementation in S3
|
@pablojim awesome - I'll take a look this week! |
|
General implementation notes:
Random thoughts/possibilities for improvements:
|
| segment_keys = version['segment_keys'] | ||
| assert len(segment_keys) == 1, "should only be one segment for parquet" | ||
| # TODO this is S3 functionality bleeding out of the backing store. | ||
| # Currently reading a Pandas dataframe from a parquet bytes array fails as it only takes a file path. |
There was a problem hiding this comment.
| return sorted(dirs, key=mtime) | ||
|
|
||
| def delete_symbol(self, library_name, symbol): | ||
| """Soft deletes a symbol - no data is removed, snapshots still work. |
There was a problem hiding this comment.
Snapshots can be done with hard-links which would make deletions safe
|
Quite a nice bit of work - any comments on performance of the implementations? |
|
@jamesblackburn There would also be large improvements due to being able to load partial frames e.g. only loading selected columns and row groups. This may help cases such as #609. Still some work and implementation decisions to do though. |
| " Does it exist and is versioning enabled?".format(bucket_name)) | ||
|
|
||
|
|
||
| class S3KeyValueStore(object): |
There was a problem hiding this comment.
Would it make sense to have an abstract class KeyValueStore which has the methods for the KV api
and then have e.g. S3KeyValueStore as one concrete implementation maybe in a separate file ?
There was a problem hiding this comment.
So if e.g. someone wants to implement a KV store on different backing storage, could simply extend the KeyValueStore class and not the S3-specific one ?
| version['segment_count'] = len(segment_keys) # on appends this value is incorrect but is updated later on | ||
| version['append_size'] = 0 | ||
| version['append_count'] = 0 | ||
| version['segment_keys'] = segment_keys |
There was a problem hiding this comment.
aha, this is the forward-pointer implementation, where the version keeps all the segment keys, cool
| previous_version = self._backing_store.read_version(self.library_name, symbol) | ||
|
|
||
| handler = self._write_handler(version, symbol, data, **kwargs) | ||
| handler.write(self._backing_store, self.library_name, version, symbol, data, previous_version, **kwargs) |
There was a problem hiding this comment.
Would be nice to think about decoupling further and have:
- the VersionStore top level api (looks great, as shown here)
- a read/write handler being a composite of:
- version metadata handling. Metadata to be attached on the version are returned in the form of a dict byt he handler. The handler is not aware of version/version documents, but only handler-specific metadata. The write handler write() expects metadata to be passed, and returns to the to top-level version new metadata, to be attached in the version document.
- serialization handler. Can be numpy recarry serializer, Arrow serializer, anything.
- segmentation policy. How to segment data, size of segments (upper size bound probably dictated by the backing store)
- compression handler
- As you already have in the code, a backing_store handler, which is responsible for writing individual segments + associated per-segment metadata at the underlying storage.
Having such a well-separated model, one may create custom implementation of individual handler for serialization/segmentation/compression/backing_store.
The logic of e.g. converting a numpy array to a rec array, segmenting, producing byte arrays and finally write chunks, is now very integrated and hard to add different/new implementations.
There was a problem hiding this comment.
I thought about this approach. One worry is that it becomes very complicated with an explosion of possible interactions between the different parts which may or may not make sense e.g. mongodb metadata with parquet serialisation with 2Mb chunking on a S3 backend etc. etc.
Each library would then have to be configured with a particular config to work correctly. I think it might be an abstraction too far.
|
@pablojim Shame to let this bitrot, can we discuss later this week with @willdealtry, @shashank88 |
Yeah, this seems pretty good, will go through it tonight. Have fixed the merge conflict. Will see if the tests are fine |
|
If we don’t think this is prod ready or a feature we want to support long term. Maybe we can segregate it is an example and make sure our API allows this sort of flexibility. Sent with GitHawk |
Apart from some dependencies it is completely isolated from the rest of Arctic. It's all in the "pluggable" package and duplicates some code from the main APIs. One option would be to merge it but mark it in the code and documentation as Beta until it is deemed ready for wider use. |
|
Will move to a contrib directory to unblock this PR, without committing this to be used as a production store. |
First implementation of Version Store with pluggable backends. POC in S3 with read, write, soft delete & snapshots using the existing VersionStore chunking and serialisation mechanisms. Append and hard deletes are not implemented.
This implementation stands alone and has no effect on existing functionality. Contains lots of code duplication form the existing functionality. Has limited error checking and cleanup functionality. This PR is mostly for discussion purposes at this point.