Feature/server outage screen#3199
Conversation
|
@MaartenD There are offline features in the app, so the lichess is down indicator should be less invasive in the UI. How do we make sure to distinguish between the server actually being down and the player simply being offline or experiencing network issues? |
|
@HaonRekcef that one i missed. I will do my research and let you know. |
|
@HaonRekcef is the Over the board game an offline feature? Are there more? |
|
@MaartenD there are multiple. You can disable the network on your device and see which buttons are interactable and not greyed out. |
|
@HaonRekcef what about this version? This is when in flightmode. Better.messaging.to.communicate.server.outage_ws_with_offline.mp4
Will upload a version when the websocket connection isn't working later today or tomorrow. I need to figure some things out first. |
|
**Behaviour during outage ** Below what is working so far. It's still work in progress but before i continue i would like to have somen answers according my approach (see Question below). Video when websocket isn't available Better.messaging.to.communicate.server.outage_ws_with_offline_wsgone.mp4
A new provider in lib/src/network/lichess_online.dart combines both checks: Currently applied to play_menu.dart, quick_game_matrix.dart and create_game_widget.dart. There are other places in the codebase that still use onlineStatusProvider directly, these would benefit from the same treatment. Question: to fully implement this feature i would like to know if you agree with my approach. Additionally, would you prefer lichessOnlineProvider to live in connectivity.dart alongside onlineStatusProvider rather than in a separate file? Tests added / updated
|
|
Hi @MaartenD thanks for the work on this! enum ConnectionStatus {
online,
networkDown,
serverDown,
} To answer your question: I am not the final authority, but I would say it doesn't matter much as long as the code is well written and works, personally I would tend towards putting it into the same file. |
|
Hi @HaonRekcef, Thanks for your reply and great suggestion according to the ConnectionStatus enum. I myself was also leaning towards putting everything in the same file. I will take that route. |
|
Reason for change New approach: ConnectionStatus enum Behaviour per status
Tests
Better.messaging.to.communicate.server.outage_ws_with_offline_II.mp4
Better.messaging.to.communicate.server.outage_ws_with_offline_wsgone_II.mp4To be clear: given the scope of this PR, I used Claude as an AI assistant during development (final implementation). |
|
@MaartenD I have not read the comments but only the PR description (which I hope you updated based on the last code). I have not read the code either, but based on the description I don't see how this can work. How do you distinguish a server outage from a network disconnection? Even if the socket is disconnected for more than 30s, that does not mean the lichess WS server is down. We certainly don't want to display a message indicating that the lichess server is down if that is not the case. And I don't see how you can know that by just monitoring the WS connection. I am pretty sure this feature cannot be implemented as is, or am I missing something? I invite you to reach out to the lichess server devs on discord to see how this is implemented in the website. |
|
@veloce i updated the PR description and pushed my latest version of this outage screen feature. I have seen the conflicts and will check that out. Would love to hear what you think about this version. |
…-screen # Conflicts: # lib/l10n/app_en.arb # lib/l10n/l10n.dart # lib/l10n/l10n_af.dart # lib/l10n/l10n_ar.dart # lib/l10n/l10n_az.dart # lib/l10n/l10n_be.dart # lib/l10n/l10n_bg.dart # lib/l10n/l10n_bn.dart # lib/l10n/l10n_bs.dart # lib/l10n/l10n_ca.dart # lib/l10n/l10n_cs.dart # lib/l10n/l10n_da.dart # lib/l10n/l10n_de.dart # lib/l10n/l10n_el.dart # lib/l10n/l10n_en.dart # lib/l10n/l10n_eo.dart # lib/l10n/l10n_es.dart # lib/l10n/l10n_et.dart # lib/l10n/l10n_eu.dart # lib/l10n/l10n_fa.dart # lib/l10n/l10n_fi.dart # lib/l10n/l10n_fr.dart # lib/l10n/l10n_gl.dart # lib/l10n/l10n_gsw.dart # lib/l10n/l10n_he.dart # lib/l10n/l10n_hi.dart # lib/l10n/l10n_hr.dart # lib/l10n/l10n_hu.dart # lib/l10n/l10n_hy.dart # lib/l10n/l10n_id.dart # lib/l10n/l10n_it.dart # lib/l10n/l10n_ja.dart # lib/l10n/l10n_kk.dart # lib/l10n/l10n_ko.dart # lib/l10n/l10n_lt.dart # lib/l10n/l10n_lv.dart # lib/l10n/l10n_mk.dart # lib/l10n/l10n_nb.dart # lib/l10n/l10n_nl.dart # lib/l10n/l10n_pl.dart # lib/l10n/l10n_pt.dart # lib/l10n/l10n_ro.dart # lib/l10n/l10n_ru.dart # lib/l10n/l10n_sk.dart # lib/l10n/l10n_sl.dart # lib/l10n/l10n_sq.dart # lib/l10n/l10n_sr.dart # lib/l10n/l10n_sv.dart # lib/l10n/l10n_tr.dart # lib/l10n/l10n_uk.dart # lib/l10n/l10n_uz.dart # lib/l10n/l10n_vi.dart # lib/l10n/l10n_zh.dart # test/view/home/home_tab_screen_test.dart # translation/source/mobile.xml
veloce
left a comment
There was a problem hiding this comment.
Thanks for your work on this and the detailed description.
I made a lot of comments, and some are not just about the code as I have more questions about this feature and how it is handled in the website.
| responseCode: response.statusCode, | ||
| responseDateTime: DateTime.now(), | ||
| ); | ||
| ref |
There was a problem hiding this comment.
I would not put that in the global http client factory. This can be used to create clients that target other URIs than the lichess main server.
There is already a LichessClient, this logic belong here.
|
|
||
| final _logger = Logger('ServerStatus'); | ||
|
|
||
| final serverStatusProvider = NotifierProvider<ServerStatusNotifier, bool>(ServerStatusNotifier.new); |
|
|
||
| final serverStatusProvider = NotifierProvider<ServerStatusNotifier, bool>(ServerStatusNotifier.new); | ||
|
|
||
| class ServerStatusNotifier extends Notifier<bool> { |
There was a problem hiding this comment.
Add doc comment to explain the purpose of this notifier.
| return true; | ||
| } | ||
|
|
||
| void _onLagChange() { |
There was a problem hiding this comment.
There is a flaw here, you're checking only the lag change and not whether an http request previously returned an error code.
That being said, I don't think that listening to the socket is the proper thing to do at all.
Lichess down is detected when a frontend server returns a 502/503. There is a separate websocket server that communicates with the lila instance through redis. I assume the WS server can be up when the lichess backend is down, so the connected Websocket does not mean a game can be played.
Thus we should rely only on http status code change from a lichess URI. cc @ornicar and @niklasf , do you confirm?
| }, name: 'OnlineStatusProvider'); | ||
|
|
||
| /// Represents the connection state of the app with respect to the lichess server. | ||
| enum ConnectionStatus { |
There was a problem hiding this comment.
To remove any ambiguity, it would better be called LichessConnectionStatus.
| Text(context.l10n.mobileServerOutageMessage, textAlign: TextAlign.center), | ||
| const SizedBox(height: 16), | ||
| Text( | ||
| context.l10n.mobileServerOutageKeepInformed, |
| : null, | ||
| title: Text(context.l10n.openingExplorer), | ||
| enabled: isOnline, | ||
| enabled: connectionStatus == ConnectionStatus.online, |
There was a problem hiding this comment.
Lichess main server may be down while the opening explorer is still available. So you should really use the regular online status provider.
| networkDown, | ||
|
|
||
| /// The device is online but the lichess server is unreachable. | ||
| serverDown, |
There was a problem hiding this comment.
There is no distinction between the server being in planned maintenance and the server being down; we should add it since the status code can tell us that.
| children: [ | ||
| Image.asset(logo, width: 150), | ||
| const SizedBox(height: 16), | ||
| Text(context.l10n.mobileServerOutageMessage, textAlign: TextAlign.center), |
There was a problem hiding this comment.
We should probably display a different message if the server is in maintenance mode. No need to translate it for now, especially if there is not translation already available server side.
For the maintenance mode, it would make sense if the http response would contain the datetime when this maintenance is supposed to end.
Now, this would be the ideal. But I don't know whether the website makes this distinction, and if it does not we should probably do the same.
Can you please reach out to the server team and keep me informed on the 2 questions raised here? (maintenance date for 503 and whether to show a different message).
| <string name="challengeCreated" comment="Shown as a bottom banner when another player has been challenged.">Challenge created: You will be notified when the game starts.\nYou can access it from the home tab.</string> | ||
| <string name="previousPage" comment="Shows the previous page, e.g. in tournament standings">Previous</string> | ||
| <string name="orImportPgnFile" comment="Button text to import a PGN file from the device">Or import a PGN file</string> | ||
| <string name="serverOutageMessage" comment="Shown on the home screen when the Lichess server is unreachable.">Lichess is undergoing technical difficulties. We're doing everything we can, and expect to be back up very soon.</string> |
There was a problem hiding this comment.
Have you checked that these translations are not already available server side?
If they are we should use them. If they are not, I'd rather not translate the mobile part yet. See contributing guide for the explanation why we don't translate immediately the new strings.
UPDATE: How Outage screen is now implemented June 12th 20206
Summary: Outage screen (server status feature)
Goal: Distinguish between "no internet connection" (the phone has no wifi/data) and "the Lichess server is unreachable" (the phone has internet, but lichess.org is down or in maintenance), and show an appropriate message for each.
Background
While working on this feature, I asked in the Lichess-Development Discord channel how the web client detects a server-side outage versus a regular connectivity issue. revoof explained:
This confirmed that checking for HTTP 502/503 responses is the right, server-endorsed way to detect this — rather than relying on connectivity heuristics alone — and is the approach used for this feature.
Two situations, two messages
No internet connection (network down)
If the phone itself has no wifi/mobile data, the existing "No internet connection" message continues to work as before. Offline features (viewing saved games, playing puzzles offline, etc.) remain available — nothing changes here.
Server down (new outage screen)
If the phone does have internet, but the Lichess server is unreachable (outage or maintenance), we show a new, friendly outage screen with:
How is this detected?
We look at the responses the server sends back to requests from the app:
As soon as the server responds normally again, or the live connection (websocket) recovers after a real interruption, the outage screen disappears automatically.
Pull-to-refresh
On the outage screen, the user can pull-to-refresh to manually check whether the server is back up — without causing any extra network traffic beyond what the app normally does.
Respecting offline capabilities
This feature follows the same approach already used elsewhere in the app: whatever works offline keeps working, and anything that requires an online connection becomes non-clickable during a server outage (for example, the "Players/friends" and "Challenges" buttons in the top app bar). This avoids sending the user to a screen that wouldn't be able to load anyway, and the confusing error messages that would result from that.
Tests
Automated tests were added/updated for this feature, including checks that:
##Known limitation
This implementation does not yet cover the case where only the websocket connection is unavailable (while regular HTTP requests still succeed). In that scenario, the user currently gets no specific message about it. This is something i like to discuss further before deciding on the right approach.
For transparency: Claude helped me with the design and implementation of this feature.
First movie when mobile data / wifi unavailable.
outage-no-data-connection.mp4
Second movie how outage screen is acting.
I tested this by stopping and starting the lila-1 service of my local Docker container. I needed to refresh i few times at the end of the video.
outage-backend-down.mp4
Fixes #1016