feature: bulk transfer adapters#6119
Conversation
Implement the custom adapters feature to allow custom artifacts transfer grouping via external processes.
|
Hey, thank you very much for this PR and the proposed implementation of an improved transfer protocol! I think a PR of this size and import deserves a really careful review, so I apologize in advance if it takes me a bit of extra time to read through and try to think through some of the implications. One initial thought I had, after a first reading of the proposal, is that if I understand correctly it's essentially an expansion of the existing custom transfer protocol in order to permit parallel object transfers. As such, my first inclination would be to try to find a way to expand that protocol directly so that the client and server could negotiate whether they both supported parallel (i.e., "bulk") transfers, and if so use that technique, and otherwise the client would fall back to performing sequential transfers. For instance, the initiation event from the client might start with something like: { "event": "init", "version": 2.0, "operation": "download", "remote": "origin", "concurrent": true, "concurrenttransfers": 3, "paralleltransfers": true }If the server just responds according to the original custom transfer protocol with an empty JSON structure, then the Git LFS client should assume parallel transfers are not accepted and just use sequential transfers per the original protocol specification. And if the server responds with an error message, then the client should try again with a "basic" initiation message without the extra But if the server responds with, say, That's my first thought, anyway. One thing which I think would be beneficial about this type of scheme is that it's in line with how other client/server communication protocols often work, with clients trying more advanced versions of the protocol first and then falling back to earlier versions (so long as they aren't insecure). The other benefit I see offhand is that it would mean we probably could avoid introducing a full set of new Git LFS configuration options, since the existing As a potential next step, how would you feel about starting with just the proposal document (either in this PR or a new one) and only working on that at first? For instance, in PRs like #3290 and #6021, we've worked with the interested parties to develop an "RFC" (Request for Comments) before starting on any code changes. Of course, once there's general agreement on the design and the proposal PR is merged, it usually changes a little as the initial implementation progresses, because inevitably there are details we can't foresee in advance, but those are typically small corrections to the accepted proposal rather than wholesale revisions. Thank you again so much for taking the time to put together this PR, and I very much look forward to collaborating on the design and development of this new feature! |
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
|
Hey Cris! Thank you very much for your detailed feedback! This pull request is indeed to be considered primarily as a proposal. This functionality should indeed extend custom transfers in the final implementation. Some observations I saw during this implementation that I want to highlight:
Regarding the implementation - two possible ways come to mind:
I will write a full proposal for extending the custom transfer protocol in a separate document. Best regards, Alexander Bogomolets. |
Excellent, I'm glad you agree! I think it's a good overall design principle for systems to try to use the most efficient/secure/fastest options available, and then step down as necessary if those aren't supported. We can also also provide a configuration option to force the use of the original sequential-transfer design, as some users may want to do that for testing or other purposes. Regarding some of your useful notes:
I'd agree that seems less than ideal. While we certainly need (configurable) idle timeouts for any kind of network connection, we don't want to use them to measure whether objects remain to be transferred.
I think we'd be interested to see a proposal for this, certainly, and if it can be separated out from the protocol extensions to support parallel transfers, so much the better!
I confess that when I first replied to your PR, I only had something like your option #1 in mind, with most of the existing sequential transfer process being effectively implemented just as parallel transfers with a maximum parallelism of one. But if it's easier to work incrementally in a series of refactoring steps to get there, using two internal implementations along the way, that's totally understandable.
Fantastic, thank you so much! |
This commit submits a proposal document for bulk transfers and modifies the custom-transfers.md documentation to illustrate the protocol and design changes
There was a problem hiding this comment.
Thank you so much for writing up the proposal and technical specification for this new "bulk" custom transfer adapter protocol! I'm sorry it's taken me so long to do a thorough review.
I have some specific recommendations, mostly about the names of fields and making sure we aren't repurposing terms like oid and size for different purposes if we can avoid doing so, and indicating which fields are optional.
My main high-level recommendation is that we clarify that parallelism of bulk transfers is possible, but will be controlled by the concurrenttransfers settings just as it is for the basic mode. I'd keep the notion of bulk transfers separate from the concept of multiple, parallel transfer queues.
Finally, I don't mention this in any specific comment, but how do you feel about including a version field? We'd have to think through how the client should behave when dealing with a legacy service which didn't expect a version field, of course.
docs/custom-transfers.md
Outdated
|
|
||
| ```json | ||
| { "event": "init", "operation": "download", "remote": "origin", "concurrent": true, "concurrenttransfers": 3 } | ||
| { "event": "init", "operation": "download", "remote": "origin", "concurrent": true, "concurrenttransfers": 3, "mode": "basic" } |
There was a problem hiding this comment.
I think we have to allow for the mode field to be optional, with the default being the basic mode, since existing custom API services may not expect that field and return an error response.
And in fact our actual Git LFS client implementation should probably not ever include the "mode": "basic" key/value pair in its initialization messages, since it's going to try with "mode": "bulk" first and then fall back to basic mode, in which case it should resend an initialization message without "mode": "basic" on the assumption that the API service is an older one.
docs/custom-transfers.md
Outdated
|
|
||
| ```json | ||
| { "event": "complete", "oid": "bf3e3e2af9366a3b704ae0c31de5afa64193ebabffde2091936ad2e7510bc03a", "error": { "code": 2, "message": "Explain what happened to this transfer" } } | ||
| { "event": "complete", "oid": "bf3e3e2af9366a3b704ae0c31de5afa64193ebabffde2091936ad2e7510bc03a", "error": { "code": 2, "message": "Explain what happened to this transfer", "retry": false } } |
There was a problem hiding this comment.
Here too we should clarify that the retry field is optional, and that the default is false.
Fortunately, the existing Git LFS client uses the encoding/json package's standard Unmarshal() function to parse the response JSON, and that will just ignore any unknown fields, so older Git LFS clients will be able to interoperate with newer API services that include a retry field in their responses. Likewise, if newer clients include a bool Retry element in the ObjectError type, then it will just default to false if there's no retry field in a JSON error response because the client is communicating with an older API service.
As a note to myself, we apparently have no tests in our t/t-custom-transfers.sh test script which actually exercise the client's handling of JSON error responses from the server, so we should definitely add those as part of any implementation effort of the "bulk" version of the protocol.
docs/custom-transfers.md
Outdated
| { "error": { "code": 32, "message": "Some init failure message" } } | ||
| ``` | ||
|
|
||
| Responding this way allows to |
There was a problem hiding this comment.
Minor typo: this sentence appears to be incomplete.
docs/custom-transfers.md
Outdated
| * `concurrenttransfers`: reflects the value of `lfs.concurrenttransfers`, for if | ||
| the transfer process wants to implement its own concurrency and wants to | ||
| respect this setting. | ||
|
|
There was a problem hiding this comment.
We should add a list entry here for mode as well:
| * `mode`: reflects the current effective transfer mode. For the bulk protocol this | |
| must be set to `bulk`. | |
docs/custom-transfers.md
Outdated
|
|
||
| **Important**: If the number of remaining files is less than the configured | ||
| `bulkSize`, git-lfs will automatically flush the pending transfers after a | ||
| 100ms timeout to ensure timely processing of incomplete bulks. |
There was a problem hiding this comment.
Let's say a configurable timeout rather than specify 100ms exactly.
There was a problem hiding this comment.
This part should be removed. New proposal text from commit 80499f0 states an issue we should resulve during the proposal implementation: support explicit transfer queue flushes so adapter dont have to assume any tranfer queue states.
docs/custom-transfers.md
Outdated
| On receiving this message the transfer process should clean up and terminate. | ||
| No response is expected. | ||
|
|
||
| ## Protocol Flow Example |
There was a problem hiding this comment.
This section is really valuable, thank you! We just need to remember to update it with all relevant changes as the proposal is finalized.
docs/custom-transfers.md
Outdated
| For successful completion, it looks like this: | ||
|
|
||
| ```json | ||
| { "event": "bulk-complete", "oid": "bulk-12345-uuid" } |
There was a problem hiding this comment.
Again, let's use transferID in place of oid here.
There was a problem hiding this comment.
This is addressed in commit 80499f0: i opted for bid (batch identifier) field, that is persistent throughout all the messages starting from header.
docs/custom-transfers.md
Outdated
| If there was an error during the bulk transfer, it can be sent as follows: | ||
|
|
||
| ```json | ||
| { "event": "bulk-complete", "oid": "bulk-12345-uuid", "error": { "code": 500, "message": "Bulk transfer failed due to network error", "retry": false } } |
There was a problem hiding this comment.
Here again, let's use transferID in place of oid.
There was a problem hiding this comment.
This is addressed in commit 80499f0: i opted for bid (batch identifier) field, that is persistent throughout all the messages starting from header.
docs/custom-transfers.md
Outdated
| process should send a completion message for that item. | ||
|
|
||
| ```json | ||
| { "event": "complete", "oid": "22ab5f63670800cc7be06dbed816012b0dc411e774754c7579467d2536a9cf3e", "path": "/tmp/downloaded/file.bin" } |
There was a problem hiding this comment.
I suggest that we include transferID as an additional field (along with oid) in these messages, and also in the error messages below.
There was a problem hiding this comment.
This is addressed in commit 80499f0: i opted for bid (batch identifier) field, that is persistent throughout all the messages starting from header.
| ### Process to Client (Parallel Processing): | ||
| ```json | ||
| { } | ||
| {"event": "progress", "oid": "bulk-001", "bytesSoFar": 0, "bytesSinceLast": 0} | ||
| {"event": "progress", "oid": "bulk-001", "bytesSoFar": 2048, "bytesSinceLast": 2048} | ||
| {"event": "complete", "oid": "sha256:file2", "path": "/tmp/file2" } | ||
| {"event": "progress", "oid": "bulk-001", "bytesSoFar": 3072, "bytesSinceLast": 1024} | ||
| {"event": "complete", "oid": "sha256:file1", "path": "/tmp/file1" } | ||
| {"event": "bulk-complete", "oid": "bulk-001" } | ||
| ``` | ||
|
|
||
| Note: In the parallel example, file2 completes before file1, demonstrating that | ||
| completion order is independent of the order in which items were sent. |
There was a problem hiding this comment.
Rather than including a "parallel processing" example here, I think it would be useful to show or explain that additional bulk transfers can occur before the final terminate message.
The client could just start with another bulk-header/bulk-footer message pair with a series of individual object upload or download messages between them, and then process the responses, repeating as necessary before finally closing the connection with a terminate message.
As for the concept of parallel processing, I think we should simply explain that any parallelism is controlled by the concurrenttransfers settings exactly as for the basic mode.
This commit reworks the proposal to make terminology aligned with git lfs therms and address previous review comments. What is done: 1. The proposal text is updated introducing naming change (bulk -> batch) and aligning this feature with lfs batch api 2. Custom transfer protocol version is introduced and the protocol description documentation is updated to reflect version introduction 3. Batch(bulk) protocol messages fields naming corrected to evade size and oid field meaning redefinition in different messages 4. LFS Batch API documentation updated to reflect the proposed changes in supporting server negotiation
|
Hey @chrisd8088 ! Thank you for your carefull review! I submited a new version for the proposal, that address the review comments. To keep it simple for you to review the fresh version here is the main outlines:
Not all the comments are addressed for now, i will leave a comments to keep it simple for review. Best regards, Alexander Bogomoletz. |
This comment was marked as spam.
This comment was marked as spam.
There was a problem hiding this comment.
Thanks very much again for all the time and effort you've put into this proposal!
You've raised some very interesting points about how the Git LFS Batch API might also be included in this new design. However, that would increase the scope quite significantly, and given how many other services, products, and projects implement the Batch API, I think we might be better served if we can keep this proposal confined to just the custom transfer adapter.
I've tried to organize my thoughts into a few sections below. I hope they are helpful! We really appreciate how much thought and care you've put into this work.
Initial Proposal Document
One request I'd make is that the first PR we merge on this topic contain just a single file in the docs/proposals directory. That could certainly be a fresh PR, so you can keep experimenting in this one!
The initial proposal document should describe the rationale for the changes, as you've already done in the custom-transfers-bulking.md file already. I'd also include some JSON examples to illustrate the specific API changes that are proposed.
We can then merge that file sooner rather than later (since it's just a proposal) and make revisions and updates to it in subsequent PRs, instead of having to pack all those conversations into just this one PR.
If it's useful, please don't hesitate to include a diagram or two! I think GitHub should be able to render MermaidJS diagrams from a Markdown file. For instance, it helps me to think about the relationships involved in a custom transfer like this:
flowchart LR
client
api["Batch API server"]
custom["custom transfer agent"]
server["custom transfer server"]
client --> api
client --> custom
custom --> server
Terminology
I very much appreciate how you've tried to keep our terminology consistent. My one caution at this point is that we try to avoid using the same term in too many contexts.
For instance, I think we have to accept that the existing Batch API only has one "mode", which is "batch mode". I don't think we want to suggest that the Batch API can be used to negotiate a non-batch mode, whether we call that "basic" or anything else.
I also think we want to leave the term "concurrency" out of this proposal as much as possible, since it really just implies parallelism, and we already support that feature, albeit with some limitations and issues (and see the next section below on that point). On reflection, I shouldn't even have suggested the term "parallel transfers" back in my very first comments on this PR.
I've been mulling over some possible terms for the last couple of weeks, and it seems to me that the core change you'd like to introduce is the ability for the Git LFS client to request that a custom transfer agent fetch or push a set or bundle of objects at once, rather than fetching or pushing objects individually.
So one possible term rather than "batch" or "bulk" is "bundle", or, if that doesn't suit, perhaps "set" or "group" or "pack".
The new version of the custom transfer protocol's initialization message might then include, along with a "version": 2.0 field, a "bundleSize" field whose default value is assumed to be 1 for the legacy version of the protocol. A newer Git LFS client would then send an initialization message like:
{ "event": "init", "version": 2.0,
"operation": "download", "remote": "origin",
"concurrent": true, "concurrenttransfers": 3,
"bundleSize": 100 }The custom transfer agent would then expect to receive messages from the Git LFS client about sets or "bundles" of Git LFS objects, with a maximum of 100 objects per bundle. (There could of course be fewer objects in a bundle than the maximum, if there weren't 100 objects left to be transferred, or retries needed to occur, etc.)
Existing Implementation Issues
I thought I would take the opportunity to describe what I believe is a bit of an oversight in our existing implementation with regards to concurrency.
To be clear, I don't think we need to discuss this issue in the proposal, since that's about transferring bundles or sets of objects instead of single objects, rather than concurrency or parallelism. (After all, even if concurrency is disabled via configuration options, a custom transfer agent could still transfer objects in bundles, so these are separate concepts.)
Nevertheless, I wanted to mention this issue because it may have occurred to you as well.
As you know, the lfs.concurrentTransfers option specifies the number of custom adapter processes that we start (unless the lfs.customTransfer.<name>.concurrent flag is set to false, in which case we only start one process). We also start one internal goroutine (i.e., thread) to communicate with each process.
This all occurs within the Begin() method of the adapterbase structure where we loop as many times as is specified by the lfs.concurrentTransfers option, or the value with which we've overridden it. This loop calls the WorkerStarting() method of the customAdapter structure because we've set that structure's 'transferImpl` element to point back to itself.
The WorkerStarting() method, in addition to starting one custom transfer agent process, then sends an initialization message to the process.
That all makes sense, but where I think we have an oversight is that the initialization message includes concurrent and concurrenttransfers elements, and we've documented that we include the latter specifically in case "the transfer process wants to implement its own concurrency and wants to respect this setting".
But there's no way for the process to "respect" the concurrenttransfers setting since we've already started that many separate processes, and each one just gets a copy of the same initialization message.
Arguably, the Git LFS client should either start a single process, pass it the concurrency settings, and let it figure out how many child processes to spawn, or start multiple processes per the concurrency settings but then not pass those settings on because there's no purpose to doing so.
However, we've had this design and implementation for many years, so we can't really reverse course now. However, perhaps we can find a way to improve the situation as part of the "version": 2.0 revisions to the custom transfer protocol.
Other Issues
As a final note, I think your comments about the idle timeout and lack of retry handling in the existing custom transfer adapter are worth noting the proposal document as well, if only as concerns to be addressed later.
| * `supported_transfer_concurrency_modes` - Optional Array of String identifiers for | ||
| transfer concurrency modes that the client supports. If omitted, the server MUST | ||
| assume that only the `basic` transfer concurrency mode is supported. | ||
| * `preferred_transfer_concurrency_mode` - Optional String identifier for the transfer | ||
| concurrency mode that the client prefers. If omitted, the server MUST assume the | ||
| `basic` transfer concurrency mode is preferred. |
There was a problem hiding this comment.
If we're amending and extending the Batch API, I think we have to consider how these modes would apply (or not apply) to existing transfer adapters, especially the HTTP and SSH adapters.
It's worth remembering here that while custom transfer adapters are a feature of this project, they are labelled as "experimental" in at least one list in our documentation, and even if we consider that to be outdated, I think we can assume that a large percentage of the other projects which implement the Batch API don't also support custom transfer adapters.
The Batch API, on the other hand, is going to be common to a lot of other projects and many public and private services, so before we revise the API in any way, we should be very clear about those changes. For instance, what does it mean for a client to advertise to an HTTP-based Git LFS server that it supports a "bulk" or "batch" mode? Does it expect multiple Git LFS objects to be packed together in some format and retrieved via a single HTTP request?
Hey git lfs team, hope you're doing great!
This PR introduces a new Bulk Transfer Protocol mechanism that complements the existing custom transfer mechanism to allow processing multiple files simultaneously in bulks. This feature allows different transfer planning strategies (compared to current sequential one artifact per process strategy in custom transfers) for new transfer implementations.
This is important for repositories with many LFS objects of different file sizes. The new mechanism should allow more advanced transfer techniques like data deduplication per multiple files or multifile packing in a more easy manner compared to custom transfers.
PR contains protocol specification in docs/bulk-transfers.md and the draft implementation in bulk.go.
Also the test infrastructure:
I am no good in go programming so while developing this code i used AI help. I am not entirely sure on the quality of this exact request though i tried my best to follow the patterns i saw in custom.go.
Best regards, Alexander Bogomolets.