...
So, why do connections "go stale"? The fact is TCP/IP and the Internet built on it is designed to work with disruption. Network outages, re-routes, and other hiccups occur all the time. The longer a connection stays open, the higher the likelihood it will be impacted by such a hiccup will impact it. Unfortunately, some connections require long-running connections and are one-way in nature, such as an SSE stream. And, as indicated above, in this scenario, the client has to be responsible for determining the health of the connection. This is because it is receiving data. The server has no way of knowing if the client actually received it, and therefore can't really determine if there is an issue on the connection.
The recommendation, in this case, is for the client to restart the connection as soon as it determines no events are actually coming through. This is the real value of the POKE events. We expect them every 5 seconds, so theoretically, not receiving any POKE events (or DATA events) for a period of more than 5 seconds should be alarming. But don't go move too crazyquickly. Sometimes, things happens happen, and a POKE is delayed, though not likely ever missing. So, wait 30 seconds or a minute. If no POKE comes in during that time, it's probably time to kill the client side of the connection and try again. The longer this time is, though, the longer the server might be trying to send DATA events up the connection, despite the client not receiving it. This can mean dropped events. As such, in the event the client determines it needs to shut down the connection for this reason, it should also delete the stream and create a new one. Unfortunately, the stream creation process is only granular at the day level, so it would then be required to re-process events for the entirety of the day to the point of termination. And, because even a single "hung" connection can lose DATA events, you would need to must terminate all connections before recreating the stream. Lost events are lost. Only a new stream or a replay stream can be used to recover. And despite having a replay stream, the recreate process is highly recommended unless the org is so high volume that additional resources beyond their maximum 20 connections are required to handle the normal volume along with replay.