Skip to content

Large set of files to transfer to client on other machine causes failing clients #33

Description

@krichter722

After setting up a coda server and client connections on the server and client machine (2 machines in my LAN involved), I'm seeing the following behaviour:

  • Transferring data from ZFS to /coda/[realm] with rsync the transfer rate is good (40MB/s) for some GB of data, then the transfer stalls so that the next file is only transferred after ~10minutes, sometimes stalls for hours, sometimes doesn't progress over night. venus on the server machine has
12:43:36 fatal error -- Recov_LoadRDS: heap mismatch (0x50000000, d0488000) vs (0x50000000, 2d0488000)
Assertion failed: 0, file "venusrecov.cc", line 519
***BackTrace***
/usr/sbin/venus(coda_assert+0x76)[0x562f976a5a66]
/usr/sbin/venus(_Z5chokePKciS0_z+0xc8)[0x562f97664428]
/usr/sbin/venus(_Z9RecovInitv+0x335)[0x562f976617f5]
/usr/sbin/venus(main+0x332)[0x562f976305d2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fc1739433f1]
/usr/sbin/venus(_start+0x2a)[0x562f976328fa]
Sleeping forever.  You may use gdb to attach to process 16261.

in the logs (the process has no backtrace in gdb).

  • venus randomly crashes on both machines due to
[42079.451522] coda: venus_pioctl: Venus returns: -22 for (00000001.ffffffff.fffffffc.00000000)
[42083.097729] coda: Unexpected interruption.
[42083.097735] coda: venus_pioctl: Venus returns: -4 for (00000004.01000001.00000001.00000001)

(dmesg displays coda: Venus dead, not sending upcall)

  • venus on the server machine furthermore crashes due to
13:21:29 fatal error -- fsobj::dir_Create: (.o6MxU0L3M1pnfonh4qr63fXobG5b2KtU,K04npr-Ldt841.Pfm1is, 2.7f000000.fffffffe.80f4f) Create failed 27!
13:21:30 RecovTerminate: dirty shutdown (1 uncommitted transactions)
Assertion failed: 0, file "fso_dir.cc", line 98
***BackTrace***
venus(coda_assert+0x76)[0x560019991a66]
venus(_Z5chokePKciS0_z+0xc8)[0x560019950428]
venus(_ZN5fsobj10dir_CreateEPKcP8VenusFid+0x12b)[0x56001993ae2b]
venus(_ZN5fsobj11LocalCreateEjPS_Pcjt+0x23)[0x5600199365e3]
venus(_ZN5fsobj18DisconnectedCreateEjjPPS_Pctii+0x291)[0x560019936951]
venus(_ZN5fsobj6CreateEPcPPS_jti+0x50)[0x560019936a00]
venus(_ZN5vproc6createEP11venus_cnodePcP10coda_vattriiS1_+0x2af)[0x56001997404f]
venus(_ZN6worker4mainEv+0x91d)[0x56001991dedd]
venus(_Z13VprocPreamblePv+0xbe)[0x56001996e0ae]
/usr/lib/coda/liblwp.so.2(+0x5d7c)[0x7f38193b2d7c]
/lib/x86_64-linux-gnu/libc.so.6(+0x357f0)[0x7f381876f7f0]
/lib/x86_64-linux-gnu/libc.so.6(sigsuspend+0x16)[0x7f381876fb26]
/usr/lib/coda/liblwp.so.2(lwp_makecontext+0x124)[0x7f38193b2f04]

which is not the same incident given the delay in time, but probably related. I/O operations with rsync on the server side then take for ever (> 30 minutes without progress without any noticable I/O in iotop).

  • on the client machine I'm getting I/O erros and
[ W(13) : 0000 : 15:37:27 ] fsobj::TryToCover: vdb::Get(#@.Trash) failed (110)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:27 ] fsobj::TryToCover: vdb::Get(#@.Trash-1000) failed (110)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:46 ] fsobj::TryToCover: vdb::Get(#@.Trash) failed (110)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:46 ] fsobj::TryToCover: vdb::Get(#@.Trash-1000) failed (110)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:47 ] fsobj::TryToCover: vdb::Get(#@.Trash) failed (110)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:47 ] fsobj::TryToCover: vdb::Get(#@.Trash-1000) failed (110)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:47 ] Cachefile::SetLength 4096

in venus.log for some files only.

  • As soon as venus.log contains ***LWP (0x55dc733053c0): Select returns error: 4 the installation seems to be impossible to recover. I got into this state the last two times I wanted to get coda running and I'm now in it.

These issue might be separate or connected, I'll separate them into different reports if you explain me a separation criteria - it's just very hard to understand what's going on if crashes happen due to non-verbose/difficult to understand assertion failures.

I noticed that venus is started with a delay of some seconds which is unrelated to the coda-client systemd unit because it's stopped which might interfere with a venus -init which sometimes restores responsiveness of the client after a reboot, but causes data loss.

experienced with 6.11.2-1+ubuntu16.10 on Ubuntu 16.10 amd64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions