Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

SSE connections, by their nature, are one-way connections. The server sends and the client receives. There is no way for the client to acknowledge the events or the connection itself and confirm to the server that things are ok. As such, the server isn't listening for that. What this ultimately means is the client is responsible for managing the connection. There are two recommendations that ideally should be used in conjunction with each other.

The first recommendation is to force the disconnect of the connection after a set amount of consecutive POKEs. Consecutive POKEs means no DATA is actively coming through. It takes a large volume of data to have a connection not have consecutive POKEs. You can determine what makes sense for your organization, but 5 is a good starting point. 5 Pokes is about 25 seconds. There has been no data in 25 seconds, so you disconnect, let some data build up and, after some period of time, reconnect. You would then wait a few seconds and reconnect. If before there are any DATA events, there are, for example, 5 more consecutive POKEs, you would reconnect again, this time waiting even more time before connecting again. Continue this pattern and increase the reconnect time (to a maximum) until there are some DATA events again. At any point that there is a DATA event, the counters and timers reset. This pattern is recommended because it doesn't unnecessarily leave a connection with no actionable data open, which is good for "connection hygiene" (see recommendation 2 below) and is actually considered good security practice. Only leave connections open that have value.

Validic is unable to determine what data the client has ingested, and whether a connection is being maintained. Due to that, maintaining a connection to the streaming API for continuous data ingestion is the client's responsibility.

The first recommendation is that if you don’t expect data for an extended period of time, ie. during maintenance, testing, or expected periods of inactivity, you should disconnect your stream and reconnect when the inactivity is completed. Validic will maintain a customer's checkpoint in a stream for 7 days after a disconnect. If the reconnection happens during that 7 day window then there will be no data loss.

The second recommendation is where a long-running connection to a stream could encounter a scenario where the stream goes “stale”. So, why do connections "go stale"? The fact is TCP/IP and the Internet built on it is designed to work with disruption. Network outages, reroutes, and other hiccups occur all the time. The longer a connection stays open, the higher the likelihood that such a hiccup will impact it. Unfortunately, some connections require long-running connections and are one-way in nature, such as an SSE stream. And, as indicated above, in In this scenario, the client has to be responsible for determining the health of the connection. This is because it is receiving data. The server Validic has no way of knowing if the client actually received itdata, and therefore can't really determine if there is an issue on the connection.

The recommendation, in this case, is for the client to restart the connection as soon as it determines no events are actually This is where the POKE events come in handy. Every five seconds when no data events or connection events are coming through the stream you will see a POKE event in your stream. 

The client should be aware when no data, connection, or POKE events are coming through. This is the real value of the POKE events. We expect them every 5 seconds, so theoretically, not receiving any POKE events (or DATA events) for a period of more than 5 seconds should be alarming. But don't move too quickly. Sometimes, things happen, and a POKE is delayed, though not likely ever missing. So, wait 30 seconds or a 1 minute. If no POKE comes , data, or connection events come in during that time, it's probably time to kill the client side of the connection and try again. The longer this time is, the longer the server might be trying to send DATA events up the connection, despite the client not receiving it. This can mean dropped events. As such, in the event the client determines it needs to shut down the connection for this reason, it should also delete the stream and create a new one. Unfortunately, the stream creation process is only granular at the day level, so it would then be required to re-process events for the entirety of the day to the point of termination. And, because even a single "hung" connection can lose DATA events, you must terminate all connections before recreating the stream. Lost events are lost. Only a new stream or a replay stream can be used to recover. And despite having a replay stream, the recreate process is highly recommended unless the org is so high volume that additional resources beyond their maximum 3 connections are required to handle the normal volume along with replayyou have a stale stream. In this scenario, the recommendation would be to take the following action:

  1. Disconnect the stale stream

  2. Delete the stale stream

  3. Create a new stream with the same resource_filter and  event_type_filter as the previous stream. Set the start_date for today. That will ensure you replay the day’s events to make sure you don’t drop any data. 

  4. Connect to the new stream

    1. Make sure you have logic in place to recognize what is already written to your DB, what is a new event, and what is an update event to ensure no duplicate data and that all updates are captured. 

    2. Be aware when using a start_date on a stream that behind the scenes the stream will run through all the data in the last 30 days. You will expect to see a long period of POKE events until the stream reaches the day you entered as the start_date on the stream. This is expected behavior and shouldn’t be a cause for concern. 

  5. Sit back and rely on the logic you wrote that noticed the ‘stale’ stream previously to catch any scenario where no POKE, data, or connection events come through for 1 minute. If it occurs again, then repeat the steps above.