Rework backup & restore#1760
Conversation
5dbf5c9 to
d7015cd
Compare
ce77161 to
60c547a
Compare
60c547a to
04cf75c
Compare
sadiqkhoja
left a comment
There was a problem hiding this comment.
Haven't finished the review, but here are my current comments/questions
|
I tried running this PR locally and here are my observations:
but I got the following message at the end: So I assumed all worked out fine. So I started my server and I couldn't logged in. When checked the DB, all of the tables were empty. The tables were actually created by knex migrations on the start of the server. My understanding is that my local database was in terrible shape when I took the backup and now while restoring it is causing some constraints failures. My concern is around the messaging, in those cases we shouldn't claim success.
|
| stdio: ['pipe', 'pipe', 'inherit'], | ||
| }, | ||
| ); | ||
| pipeline(reconstitutedInput, spawned.stdin); |
There was a problem hiding this comment.
should .catch be added here?
There was a problem hiding this comment.
If anything goes south with either of these two processes, we'll already hear from it because the process exit statuses are checked.
There was a problem hiding this comment.
without .catch (this PR) when target database is missing:
% node lib/bin/restore.js ~/jubiliantApril13
restoreBackupFromRestoreStream
psql: error: connection to server at "localhost" (::1), port 5432 failed: FATAL: database "jubilant" does not exist
node:internal/process/promises:394
triggerUncaughtException(err, true /* fromPromise */);
^
Error: write EPIPE
at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:87:19) {
errno: -32,
code: 'EPIPE',
syscall: 'write'
}
Node.js v24.13.0
with .catch(e => console.log('error occurred') (modifying the PR):
% node lib/bin/restore.js ~/jubiliantApril13
psql: error: connection to server at "localhost" (::1), port 5432 failed: FATAL: database "jubilant" does not exist
error occurred
Error: process exited with code: 2, spawnfile: psql, args: psql,--no-password,--no-psqlrc,--quiet,--echo-errors
at awaitSpawnee (/Users/sadiq/Repos/central-backend/lib/util/process.js:8:17)
at process.processTicksAndRejections (node:internal/process/task_queues:103:5)
at async Promise.all (index 1)
at async restoreFromEncryptedPgDumpStream (/Users/sadiq/Repos/central-backend/lib/bin/restore.js:46:3)
at async main (/Users/sadiq/Repos/central-backend/lib/bin/restore.js:79:5) {
exitcode: 2
}
|
@brontolosone, regression testing for v2026.1 will be starting soon, so this PR probably needs to get merged soon if it's to make it into the release. I'm thinking that it'd also be OK to ship with .2. |
|
@brontolosone touched base today and talked a little bit about this. I advocated for writing the backup to disk and then sending it over the wire to avoid any case where the backup might fail/truncate but the response code would be 200. We also agreed to get this into .2. |
|
This comment contains 3 claims. First claim: " Right off the bat: This is expected, HTTP is not transactional this way. Responses may be interrupted for any reason (server crashes, someone pulls the plug on the router, ...). This goes for the whole API, not just the /backup endpoint. As I currently understand it, ideally we'd abort the TCP connection with an RST, which should result in a network error from the clientside perspective, and then the client can take appropriate action. But it seems Node can't actually be made to do that. That said, on to the specifics on how you made Side note: This makes it impossible to start the server the docker-way or the Anyway. I can't repro exactly what you did, for wont of detail, but at least when I paraphrase your act of sabotage, this has the intended effect of making which leaves us an interrupted stream, evidenced by: curl -H 'Accept: application/json, text/plain, */*' -H 'Authorization: Bearer blabla' -X POST http://127.0.0.1:8383/v1/backup | openssl enc -d -chacha20 -pbkdf2 -pass pass:'' | pg_restore -f /dev/nullpg_restore exits nonzero and reports So, my claim is, that in order to verify one's backups, one needs to run something similar to the above, for which I intend to write something in the docs. For the specific DB permissions sabotage from your scenario, and some similar scenarios, we could detect early whether the backup has a chance succeeding. For instance, we could do a test There's one amendment that would slightly improve this situation (but not without downsides, listed below), which I discussed with @lognaturel last week: if we would spool the response to disk first, we can serve it out with a
All these were downsides/problems in the previous backup implementation, too. But note that that implementation didn't have the |
|
Second claim from the comment: "I created a (broken) backup that upon restore, broke my database" How you end up with a broken backup and the extent to which we can/should do anything about that has been discussed here. I cannot reproduce this. If I take my broken backup created here and restore it, I get: And since everything's wrapped in a transaction, everything done up to the point where Can you revisit your repro? |
|
I'm currently looking into, in the event of process errors, making chunked transport encoding not send the finalizing 0-length chunk header, to signal a broken stream. That's probably what we want as it would give an in-http client-detectable signal that an error has occurred. |
cca2f1e to
2ff01d2
Compare
|
See 120a812 which makes pg_dump errors manifest themselves as an error in the node-stream, resulting in a prematurely ended chunked-transport-encoded http stream, which are client-detectable; eg I think this is a reasonable middle ground? "don't use the backup if your http library tells you the download was incomplete"? I can put it in the docs. @lognaturel would you still like to have a mutex on concurrent backup runs, even though we're not using disk space? It still takes server CPU resources of course, but so does everything else. |
|
Claim 3:
I can't repro that 😿 |
8432916 to
51e4221
Compare
Sadly, nginx as a reverse proxy un-chunks it, and apparently isn't alarmed by the short read. I'm going to have a look to see if that can be amended. |
easily solved by adding |
cd2b7de to
fa85d3e
Compare
Even if you prep the whole file serverside beforehand, while streaming it out, the stream might still break halfway. Chunked transfer encoding is the answer. And once you have that, then why entertain the downsides of using temp storage — just stream all the way. |
b75ae25 to
769e45b
Compare
No, let's not. |
|
Related: the documentation amendments and additions with following this PR, are in PR getodk/docs#2052 . They may give the reviewer yet more context :-) |
| stdio: ['pipe', 'pipe', 'inherit'], | ||
| }, | ||
| ); | ||
| pipeline(reconstitutedInput, spawned.stdin); |
There was a problem hiding this comment.
without .catch (this PR) when target database is missing:
% node lib/bin/restore.js ~/jubiliantApril13
restoreBackupFromRestoreStream
psql: error: connection to server at "localhost" (::1), port 5432 failed: FATAL: database "jubilant" does not exist
node:internal/process/promises:394
triggerUncaughtException(err, true /* fromPromise */);
^
Error: write EPIPE
at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:87:19) {
errno: -32,
code: 'EPIPE',
syscall: 'write'
}
Node.js v24.13.0
with .catch(e => console.log('error occurred') (modifying the PR):
% node lib/bin/restore.js ~/jubiliantApril13
psql: error: connection to server at "localhost" (::1), port 5432 failed: FATAL: database "jubilant" does not exist
error occurred
Error: process exited with code: 2, spawnfile: psql, args: psql,--no-password,--no-psqlrc,--quiet,--echo-errors
at awaitSpawnee (/Users/sadiq/Repos/central-backend/lib/util/process.js:8:17)
at process.processTicksAndRejections (node:internal/process/task_queues:103:5)
at async Promise.all (index 1)
at async restoreFromEncryptedPgDumpStream (/Users/sadiq/Repos/central-backend/lib/bin/restore.js:46:3)
at async main (/Users/sadiq/Repos/central-backend/lib/bin/restore.js:79:5) {
exitcode: 2
}
In my case, owner of ErrorSo two things went wrong: 1) I was using user with insufficient privilege 2) a different pg_dump version (maybe it's okay to use pg_dump 17 🤷♂️). I agree with your sidenote about docker-way, so both of the issues won't be happening in real world or if they do then we don't really need to support them. I am sharing these detail in case you are curious about them. For me, all of these are not so much important. My concern was around informing clients about the failure and the explanation you provided around HTTP is satisfactory to me. Now the solution you have proposed of sending Null chunk to signal broken stream is perfect but I couldn't verify it locally, maybe I'm doing something wrong (should we setup a call?): Curl never terminates in case of backend errorIn the nginx logs, I see: nginx version: central-frontend/nginx-conf/main.conf diff: Curl version: Repro of second issue reported earlier:
output.logRepro of the third issue: `% node lib/bin/restore.js ~/freshbkOnMasterWithPP.zip '123123'`if I put a wrong passphrase then I see Upon further digging, it turned our that original backup was created using pg_dump 17.5 and when I ran |
Issue 1: during backup, any errors should be clientside detectableWe seem to consider this solved now then! You say:
❗ Note that it works the opposite way: to signal a broken stream, you end it without sending a null chunk. To see it in action you could also take a look at this test of getodk/central#1737 . Issue 2: not all errors result in non-success status, or at least, we find a success message printed even while we also find errors printedI couldn't repro getting a success message under error conditions, but I did change things around to explicitly catch a stream error occurring "in node" when thing go sideways (eg, in pg_restore when its target DB doesn't exist, or restoring an archive made with pg_dump18 using pg_dump14). The crux of it, to my understanding: with these Node streams, it's not really actually Anyhow, with some amendments, instead of just letting things run their course and counting on that the error messages will be noted, we now complain more loudly and visibly to the user, as not every user might monitor the exit status (or the chatter on stderr) when they run Unless there's still some unmonitored stream somewhere, I think you should now be seeing either the success message or an ominous "DATABASE RESTORE FAILED" message... |
- Retire zip-based DB backup format - Replace with end-to-end streamable format (getodk/central#1646) - For that new format, use a revised restore procedure (getodk/central#1646)
Co-authored-by: Sadiq Khoja <sadiqkhoja@gmail.com>
…a prematurely ended chunked-transport-encoded stream, which are client-detectable
bd57233 to
f6b07e6
Compare
f6b07e6 to
dbd0d77
Compare
Closes getodk/central#1646
Closes getodk/central#1639
Related: getodk/central#1736
Related: getodk/docs#2052
Implements a new backup format, and a new restore routine, per the proposal of getodk/central#1646 .
Notable:
backup-restore.she2e test. So the restore from that format throughlib/bin/restore.jsis not automatically tested anymore. I tested it manually 🤷♂️lib/bin/restore.jscontent sniffing is used to distinguish between the old and new formats.restore.jswas writing to stdout yet actually was expected to return a task result or something to then be processed, or something, it all didn't make sense anymore and got in the way of actually effecting a process exit code, so I reorganized a bit there too. Throwing "Problem"s also didn't make sense to me here, nor does printing json{ "success": true }to a terminal user, it felt like too many layers of old paint.pg_dumpbinary format and it seems that sections are prefixed with sizes (thus pg_restore would notice a short read), and confirmed this with an experiment in which I tested lobbing off all possible truncations (1…file length) from a test pgdump to see if there is a way for pg_restore to accept it as valid even though it's truncated. There doesn't seem to be, which is nice for a backup format 👍, and very nice for us since it means the backups are verifiable, all the more important because they're streamed out and users may have half-a-downloaded-dump upon, for instance, a network error. The docs will contain info on how to verify backups.What has been done to verify that this works as intended?
Tests
Why is this the best possible solution? Were any other approaches considered?
See getodk/central#1646
How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?
Someone could be counting on, or have automated mitigations for, the legacy idiosyncratic (baffling and dangerous, under certain conditions) restore procedure, see getodk/central#1646 . So someone out there may be surprised that the restored database content actually ends up in the configured database rather than the database as recorded internally in the dump (which only incidentally could be the same as the name of the configured database).
Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.
Not the API docs, but tracking desire for user-facing documentation updates here.
Before submitting this PR, please make sure you have:
make testand confirmed all checks still pass, or witnessed Github completing all checks with success