Large set of files to transfer to client on other machine causes failing clients

After setting up a coda server and client connections on the server and client machine (2 machines in my LAN involved), I'm seeing the following behaviour:

  * Transferring data from ZFS to `/coda/[realm]` with `rsync` the transfer rate is good (40MB/s) for some GB of data, then the transfer stalls so that the next file is only transferred after ~10minutes, sometimes stalls for hours, sometimes doesn't progress over night. `venus` on the server machine has

```
12:43:36 fatal error -- Recov_LoadRDS: heap mismatch (0x50000000, d0488000) vs (0x50000000, 2d0488000)
Assertion failed: 0, file "venusrecov.cc", line 519
***BackTrace***
/usr/sbin/venus(coda_assert+0x76)[0x562f976a5a66]
/usr/sbin/venus(_Z5chokePKciS0_z+0xc8)[0x562f97664428]
/usr/sbin/venus(_Z9RecovInitv+0x335)[0x562f976617f5]
/usr/sbin/venus(main+0x332)[0x562f976305d2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fc1739433f1]
/usr/sbin/venus(_start+0x2a)[0x562f976328fa]
Sleeping forever.  You may use gdb to attach to process 16261.
```

in the logs (the process has no backtrace in `gdb`).
  * `venus` randomly crashes on both machines due to 

```
[42079.451522] coda: venus_pioctl: Venus returns: -22 for (00000001.ffffffff.fffffffc.00000000)
[42083.097729] coda: Unexpected interruption.
[42083.097735] coda: venus_pioctl: Venus returns: -4 for (00000004.01000001.00000001.00000001)
```

(`dmesg` displays `coda: Venus dead, not sending upcall`)

  * `venus` on the server machine furthermore crashes due to

```
13:21:29 fatal error -- fsobj::dir_Create: (.o6MxU0L3M1pnfonh4qr63fXobG5b2KtU,K04npr-Ldt841.Pfm1is, 2.7f000000.fffffffe.80f4f) Create failed 27!
13:21:30 RecovTerminate: dirty shutdown (1 uncommitted transactions)
Assertion failed: 0, file "fso_dir.cc", line 98
***BackTrace***
venus(coda_assert+0x76)[0x560019991a66]
venus(_Z5chokePKciS0_z+0xc8)[0x560019950428]
venus(_ZN5fsobj10dir_CreateEPKcP8VenusFid+0x12b)[0x56001993ae2b]
venus(_ZN5fsobj11LocalCreateEjPS_Pcjt+0x23)[0x5600199365e3]
venus(_ZN5fsobj18DisconnectedCreateEjjPPS_Pctii+0x291)[0x560019936951]
venus(_ZN5fsobj6CreateEPcPPS_jti+0x50)[0x560019936a00]
venus(_ZN5vproc6createEP11venus_cnodePcP10coda_vattriiS1_+0x2af)[0x56001997404f]
venus(_ZN6worker4mainEv+0x91d)[0x56001991dedd]
venus(_Z13VprocPreamblePv+0xbe)[0x56001996e0ae]
/usr/lib/coda/liblwp.so.2(+0x5d7c)[0x7f38193b2d7c]
/lib/x86_64-linux-gnu/libc.so.6(+0x357f0)[0x7f381876f7f0]
/lib/x86_64-linux-gnu/libc.so.6(sigsuspend+0x16)[0x7f381876fb26]
/usr/lib/coda/liblwp.so.2(lwp_makecontext+0x124)[0x7f38193b2f04]
```

which is not the same incident given the delay in time, but probably related. I/O operations with `rsync` on the server side then take for ever (> 30 minutes without progress without any noticable I/O in `iotop`).
  * on the client machine I'm getting I/O erros and 

```
[ W(13) : 0000 : 15:37:27 ] fsobj::TryToCover: vdb::Get(#@.Trash) failed (110)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:27 ] fsobj::TryToCover: vdb::Get(#@.Trash-1000) failed (110)
[ W(13) : 0000 : 15:37:27 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:46 ] fsobj::TryToCover: vdb::Get(#@.Trash) failed (110)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:46 ] fsobj::TryToCover: vdb::Get(#@.Trash-1000) failed (110)
[ W(13) : 0000 : 15:37:46 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:47 ] fsobj::TryToCover: vdb::Get(#@.Trash) failed (110)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.7>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.1.1>)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:47 ] fsobj::TryToCover: vdb::Get(#@.Trash-1000) failed (110)
[ W(13) : 0000 : 15:37:47 ] Allowing access to stale status! (key = <1.ff000001.fffffffc.8>)
[ W(13) : 0000 : 15:37:47 ] Cachefile::SetLength 4096
```

in `venus.log` for some files only.
  * As soon as `venus.log` contains `***LWP (0x55dc733053c0): Select returns error: 4` the installation seems to be impossible to recover. I got into this state the last two times I wanted to get coda running and I'm now in it.

These issue might be separate or connected, I'll separate them into different reports if you explain me a separation criteria - it's just very hard to understand what's going on if crashes happen due to non-verbose/difficult to understand assertion failures.

I noticed that `venus` is started with a delay of some seconds which is unrelated to the `coda-client` `systemd` unit because it's stopped which might interfere with a `venus -init` which sometimes restores responsiveness of the client after a reboot, but causes data loss.

experienced with 6.11.2-1+ubuntu16.10 on Ubuntu 16.10 amd64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large set of files to transfer to client on other machine causes failing clients #33

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Large set of files to transfer to client on other machine causes failing clients #33

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions