Hi Sebastian,
The aim with this "stop a bot after X consecutive failures within a time frame of Y minutes" was the need to stop a bot completely in order to avoid erroneous event alerts ending up to our clients. We are implementing a few bots of our own and as these bots are new (read: not production tested), we want to make sure they won't trouble our clients if (when) there are problems in our implementation.
As said, the error handling document you mentioned in your earlier message covers for this.
Cheers, Mika
----- Original Message ----- From: "Sebastian Wagner" wagner@cert.at To: "Mika Silander" mika.silander@csc.fi Cc: "intelmq-dev" intelmq-dev@lists.cert.at Sent: Tuesday, 16 February, 2021 13:17:39 Subject: Re: [IntelMQ-dev] Bot behaviour in case of unrecoverable errors
Dear Mika,
On 2/16/21 11:58 AM, Mika Silander wrote:
Thanks for answering. I've been busy with other things so np with a delayed answer. What comes to my question on how to react when a bot "dies", I see the question should be rephrased as "how to react to exceptions in a bot?". The error handling URL below suggests that I could set the parameter error_procedure (+ error_max_retries + error_retry_delay), and that should cover what is needed especially if a restart after this requires manual operation.
What is your aim? If you say how you want IntelMQ to behave, I can suggest you specific settings. The defaults should be sane already, aside from the fact that stopped components are not automatically restarted (but that's the same behavior as for systemd/... services as well).
And I can discard the (elaborate) option of making the bot always analyze its own log entries (to discover repetitive failures/exceptions) at startup.
Yes, that's not necessary. If you want to keep an eye on the logs and errors yourself, take a look at the logcheck rule set:
https://github.com/certtools/intelmq/tree/develop/contrib/logcheck
best regards Sebastian