Bot behaviour in case of unrecoverable errors

List overview All Threads
Download

newer

older

Advice for setting up tests for...

Event rate limiting

Mika Silander

2 Feb 2021 2 Feb '21

12:14 p.m.

Hi,

Trying to assess what safeguards are sufficient: what happens when a bot has some internal failure and it "dies"? Will intelmq restart the bot automatically or will it be up to the admin of intelmq to manually restart it? And if automatic restarts is the norm, how could one stop the bot from processing new incoming messages if say, X consecutive failures like these have happened within the time frame of the last 5 minutes? By writing some log entries at bot startup and then making the bot itself analyze the log at every restart?

I'm trying to make sure a burst of erroneous/malformed events are not accidentally forwarded by a malfunctioning or partially functioning bot.

Cheers, Mika

Show replies by date

Sebastian Wagner

12 Feb 12 Feb

8:45 p.m.

Dear Mika,

Sorry for the late response. I have seen the mail, but postponed answering to later and then I forgot...

On 2/2/21 1:14 PM, Mika Silander wrote:

...

Trying to assess what safeguards are sufficient: what happens when a bot has some internal failure and it "dies"?

IntelMQ has an internal error handling, so a thrown exception, e.g. in the bot's process() method does not lead to the bot dying. Documentation on this can be found at https://intelmq.readthedocs.io/en/latest/user/configuration-management.html#...

Please let us know if information is missing there so we can improve it.

...

Will intelmq restart the bot automatically or will it be up to the admin of intelmq to manually restart it?

Currently there is no such automatism by default. IntelMQ has as of now no watcher/supervising daemon itself, but we have

- integration into supervisord: https://intelmq.readthedocs.io/en/latest/user/configuration-management.html#... - and a script to generate systemd service files for bots: https://github.com/certtools/intelmq/tree/develop/contrib/systemd (and as I am reminded just now that is really badly documented)

...

And if automatic restarts is the norm, how could one stop the bot from processing new incoming messages if say, X consecutive failures like these have happened within the time frame of the last 5 minutes?

The error handling takes care of that. By default, the bot tries to process a message up to three times and then gives up on this one, dumps it to disk for further inspection of the administrator, and continues with the next message. The erroneous message is removed from the queue.

For parsers you can reduce the parameter error_max_retries, as they don't depend on external resources and temporary failures can't happen. For experts which make external lookups, retries are perfectly fine.

For more information on the dumping functionality and how to process these dumps, see https://intelmq.readthedocs.io/en/latest/user/configuration-management.html#...

...

By writing some log entries at bot startup and then making the bot itself analyze the log at every restart?

I'm trying to make sure a burst of erroneous/malformed events are not accidentally forwarded by a malfunctioning or partially functioning bot.

That won't happen, except if you explicitly configure IntelMQ to do so.

Hope that helps. If it doesn't - don't dare to ask :)

best regards Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Mika Silander

16 Feb 16 Feb

10:58 a.m.

Hi Sebastian,

Thanks for answering. I've been busy with other things so np with a delayed answer. What comes to my question on how to react when a bot "dies", I see the question should be rephrased as "how to react to exceptions in a bot?". The error handling URL below suggests that I could set the parameter error_procedure (+ error_max_retries + error_retry_delay), and that should cover what is needed especially if a restart after this requires manual operation.

And I can discard the (elaborate) option of making the bot always analyze its own log entries (to discover repetitive failures/exceptions) at startup.

Cheers, Mika

----- Original Message ----- From: "Sebastian Wagner" wagner@cert.at To: "Mika Silander" mika.silander@csc.fi, "intelmq-dev" intelmq-dev@lists.cert.at Sent: Friday, 12 February, 2021 22:45:35 Subject: Re: [IntelMQ-dev] Bot behaviour in case of unrecoverable errors

Dear Mika,

Sorry for the late response. I have seen the mail, but postponed answering to later and then I forgot...

On 2/2/21 1:14 PM, Mika Silander wrote:

...

Trying to assess what safeguards are sufficient: what happens when a bot has some internal failure and it "dies"?

Please let us know if information is missing there so we can improve it.

...

Will intelmq restart the bot automatically or will it be up to the admin of intelmq to manually restart it?

Currently there is no such automatism by default. IntelMQ has as of now no watcher/supervising daemon itself, but we have

...

And if automatic restarts is the norm, how could one stop the bot from processing new incoming messages if say, X consecutive failures like these have happened within the time frame of the last 5 minutes?

For more information on the dumping functionality and how to process these dumps, see https://intelmq.readthedocs.io/en/latest/user/configuration-management.html#...

...

By writing some log entries at bot startup and then making the bot itself analyze the log at every restart?

I'm trying to make sure a burst of erroneous/malformed events are not accidentally forwarded by a malfunctioning or partially functioning bot.

That won't happen, except if you explicitly configure IntelMQ to do so.

Hope that helps. If it doesn't - don't dare to ask :)

best regards Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Sebastian Wagner

11:17 a.m.

Dear Mika,

On 2/16/21 11:58 AM, Mika Silander wrote:

...

Thanks for answering. I've been busy with other things so np with a delayed answer. What comes to my question on how to react when a bot "dies", I see the question should be rephrased as "how to react to exceptions in a bot?". The error handling URL below suggests that I could set the parameter error_procedure (+ error_max_retries + error_retry_delay), and that should cover what is needed especially if a restart after this requires manual operation.

What is your aim? If you say how you want IntelMQ to behave, I can suggest you specific settings. The defaults should be sane already, aside from the fact that stopped components are not automatically restarted (but that's the same behavior as for systemd/... services as well).

...

And I can discard the (elaborate) option of making the bot always analyze its own log entries (to discover repetitive failures/exceptions) at startup.

Yes, that's not necessary. If you want to keep an eye on the logs and errors yourself, take a look at the logcheck rule set:

https://github.com/certtools/intelmq/tree/develop/contrib/logcheck

best regards Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Mika Silander

11:35 a.m.

Hi Sebastian,

The aim with this "stop a bot after X consecutive failures within a time frame of Y minutes" was the need to stop a bot completely in order to avoid erroneous event alerts ending up to our clients. We are implementing a few bots of our own and as these bots are new (read: not production tested), we want to make sure they won't trouble our clients if (when) there are problems in our implementation.

As said, the error handling document you mentioned in your earlier message covers for this.

Cheers, Mika

----- Original Message ----- From: "Sebastian Wagner" wagner@cert.at To: "Mika Silander" mika.silander@csc.fi Cc: "intelmq-dev" intelmq-dev@lists.cert.at Sent: Tuesday, 16 February, 2021 13:17:39 Subject: Re: [IntelMQ-dev] Bot behaviour in case of unrecoverable errors

Dear Mika,

On 2/16/21 11:58 AM, Mika Silander wrote:

...

Thanks for answering. I've been busy with other things so np with a delayed answer. What comes to my question on how to react when a bot "dies", I see the question should be rephrased as "how to react to exceptions in a bot?". The error handling URL below suggests that I could set the parameter error_procedure (+ error_max_retries + error_retry_delay), and that should cover what is needed especially if a restart after this requires manual operation.

...

And I can discard the (elaborate) option of making the bot always analyze its own log entries (to discover repetitive failures/exceptions) at startup.

Yes, that's not necessary. If you want to keep an eye on the logs and errors yourself, take a look at the logcheck rule set:

https://github.com/certtools/intelmq/tree/develop/contrib/logcheck

best regards Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Sebastian Wagner

11:59 a.m.

Hi,

On 2/16/21 12:35 PM, Mika Silander wrote:

...

The aim with this "stop a bot after X consecutive failures within a time frame of Y minutes" was the need to stop a bot completely

The error-handling is not aware of the time. With error_procedure = stop and error_max_retries = X-1 you can cover the first part. But the bot will only stop, if the error occurs X times in a row.

Hope that helps Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Mika Silander

12:43 p.m.

Thanks for pointing out it doesn't take time into account but rather the number of failures. Still, I think we can use this instead of time.

Cheers, Mika

----- Original Message ----- From: "Sebastian Wagner" wagner@cert.at To: "Mika Silander" mika.silander@csc.fi Cc: "intelmq-dev" intelmq-dev@lists.cert.at Sent: Tuesday, 16 February, 2021 13:59:43 Subject: Re: [IntelMQ-dev] Bot behaviour in case of unrecoverable errors

Hi,

On 2/16/21 12:35 PM, Mika Silander wrote:

...

The aim with this "stop a bot after X consecutive failures within a time frame of Y minutes" was the need to stop a bot completely

The error-handling is not aware of the time. With error_procedure = stop and error_max_retries = X-1 you can cover the first part. But the bot will only stop, if the error occurs X times in a row.

Hope that helps Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

1740

Age (days ago)

1754

Last active (days ago)

intelmq-dev@lists.cert.at

6 comments

2 participants

tags (0)

participants (2)

Mika Silander
Sebastian Wagner