[IntelMQ-dev] STATS IN INTELMQ?

Mon Nov 27 16:23:36 CET 2023

Hi Mika,

Thanks for the explanation of your monitoring. I think we monitor mostly 
similar things, but there is a small misunderstanding - all things from 
the second paragraph of my message do not come from the statistics in 
redis db, but mostly from intelmqctl :)

Here some additional explanation I think may be useful:

re: number of events & botnet size: this is exactly my case, with more 
than 100 bots in use (and growing).

re: size of input queues: in my case, it helps us with e.g. detecting 
stuck output bots (we have a few different) and occasionally bottlenecks 
in the botnet. Of course, it alerts when the given maximum is reached.

re: collectors producing data: this is also a little different in my 
case, I have e.g. multiple collectors talking to only one parser.

re: logging level per bot: this is easily possible, just set the 
'logging_level' in bot's config.

It looks like the operational monitoring can also vary between 
deployments, depending on local needs. There is no one-for-all approach ;)

Best regards

// Kamil Mańkowski <mankowski at cert.at> - T: +43 676 898 298 7204
// CERT Austria - https://www.cert.at/
// CERT.at GmbH, FB-Nr. 561772k, HG Wien

On 11/27/23 15:42, Mika Silander wrote:
> Hi Kamil,
> 
>   Thanks for elaborating on your monitoring setup and needs. Here's a description of ours:
> 
> - number of events: our event logger expert, we can duplicate it on the paths that need to be monitored
>    - our bot net is small so it is not an issue, but for a bigger one the number of extra bots due to this could be big
> 
> - number of errors: from the logs for those bots that do report them to their log files
>    - the weakness for us here is one can't rely on uniform/consistent reporting of errors from one bot to another (depends on implementation of each, right?)
>    - and, we do check all log files for exceptions, but an exception is usually an indication there could be a severe implementation problem in the bot itself,
>    - and, we monitor presence of event dump files
> 
> - number of bot runs / run frequency can be pulled from log files assuming the log level is sufficiently talkative
> 
> - we don't monitor the size of the input queues since the number does not indicate a problem nor tell if the processing of events progresses.
>    One should check if individual events "move forward" in the bot chain to answer that. At most I'd monitor the number does not grow past some allowed maximum
>    and the same information can probably be pulled from redis (or the message queue used) using their own tools or some wrapper scripts(?)
> 
> - whether bots are running: we use "intelmqctl status" and just analyse every bot's status line. Is this too an innocent approach?
> 
> - whether collectors are producing new data: we can detect this indirectly by analysing the logs of the parsers, i.e. do events get forwarded sufficiently often?
>    - this requires of course we have parsers that write something useful to the logs when they receive a message from a collector and forward events.
> 
>   All in all, I think we aggregate almost the same monitoring data although we have to ensure sufficiently detailed information is written to the logs ... so we also monitor the log level remains as INFO or chattier :-). It would be handy to be able to set the log level per bot to reduce the logging where applicable ... have to check if this can be done.
> 
> Br, Mika
> 
> ----- Original Message -----
> From: "Kamil Mankowski via IntelMQ-dev" <intelmq-dev at lists.cert.at>
> To: "intelmq-dev" <intelmq-dev at lists.cert.at>
> Sent: Monday, 27 November, 2023 15:48:39
> Subject: Re: [IntelMQ-dev] STATS IN INTELMQ?
> 
> I collect the number of events processed, overall and separately for
> each destination path, the number of errors, bot runs, and the size of
> the input queue. The last one is not visible in the logs, the number of
> processed with some (larger) delay, and the separation for target paths
> only in debug mode (as far as I can remember; although this is rarely
> useful).
> 
> I also check whether the bots are running, whether the collectors are
> producing new data, and the error rate, as well as the execution of
> periodic jobs (db updates etc). It's not perfect, but it works pretty
> well :)
> 
> Best regards
> 
> // Kamil Mańkowski <mankowski at cert.at> - T: +43 676 898 298 7204
> // CERT Austria - https://www.cert.at/
> // CERT.at GmbH, FB-Nr. 561772k, HG Wien
> 
> On 11/27/23 14:31, Mika Silander wrote:
>> Hi Kamil,
>>
>>    Thanks for the clarification. I've also noticed there are some hurdles and problems when relying entirely on logs for monitoring. Just out of curiosity, what are the operational monitoring needs you would tackle with statistics collection in intelmq (and can't be covered using normal logs)? I think we have managed so far with using the logs as the only source of information for monitoring, and it would be interesting to compare and see if we've possibly forgotten some monitoring target/issue.
>>
>> Br, Mika
>>
>> ----- Original Message -----
>> From: "Kamil Mankowski via IntelMQ-dev" <intelmq-dev at lists.cert.at>
>> To: "intelmq-dev" <intelmq-dev at lists.cert.at>
>> Sent: Monday, 27 November, 2023 14:59:34
>> Subject: Re: [IntelMQ-dev] STATS IN INTELMQ?
>>
>> Hi Mika,
>>
>> Thanks for the comment. I totally agree that the statistics needed are
>> definitely different for different teams.
>>
>>    > I would vote for removing the initial statistics functionality from
>> intelmq and keep statistics collection completely separate.
>>
>> I can't agree with that, because there are two different types of
>> statistics - as you said, the monitoring and the business statistics. I
>> think I used the word "statistics", but what I meant with my second
>> point is purely operational monitoring, and I'd like to keep the
>> built-in functionality operation-oriented. Monitoring through logs is
>> also possible, although it has some drawbacks.
>>
>> However, there have been some ideas to introduce hook capabilities. If
>> we want to change the monitoring (e.g. integrate with Open Telemetry)
>> and/or allow more statistics directly from IntelMQ, I'd go in the
>> direction of optional support via hooks, without extending the base
>> source code.
>>
>> Best regards
>>
>> // Kamil Mańkowski <mankowski at cert.at> - T: +43 676 898 298 7204
>> // CERT Austria - https://www.cert.at/
>> // CERT.at GmbH, FB-Nr. 561772k, HG Wien
>>
>> On 11/27/23 13:41, Mika Silander wrote:
>>> Hi Kamil & all,
>>>
>>>     From my experience it looks like collecting statistics varies wildly based on each team's particular needs. In our case, the built-in statistics functionality does not provide all the info we need so we wrote a simple expert bot to collect statistics on the events processed (IPs, constituency, classification.type, abuse C address etc etc). It outputs the statistics to rsyslog. In principle, we can put another instance of the bot to a latter stage in the bot chain to send the same information about events that have passed all sanity checks done by expert bots in between. This way the same bot answers the questions a) how many events have we processed, b) how many of these events have been successfully passed on to our clients (=constituency). Finally, the ticketing system gives us statistics about how many of those events have been properly resolved, but that's out of scope in terms of generating statistics inside intelmq.
>>>
>>>     Don't take me wrong but in this case I would vote for removing the initial statistics functionality from intelmq and keep statistics collection completely separate. This would keep the intelmq code base also (marginally) smaller.
>>>
>>>     I also remember a recommendation on this forum that intelmq instances should be monitored with the help of the log files created. I see monitoring and statistics collection as two orthogonal functionalities and this is the other reason why I would like to keep statistics collection independent, i.e. neither as part of the core modules of intelmq, nor as an indirect means of monitoring.
>>>
>>>     Well, this is just my view on the issue (the famous 5 cents), and, I am unaware of other use cases and requirements that could tilt the balance towards the current way of implementing statistics.
>>>
>>> Br, Mika
>>>
>>> ----- Original Message -----
>>> From: "Kamil Mankowski via IntelMQ-dev" <intelmq-dev at lists.cert.at>
>>> Cc: "intelmq-dev" <intelmq-dev at lists.cert.at>
>>> Sent: Monday, 27 November, 2023 13:23:56
>>> Subject: Re: [IntelMQ-dev] STATS IN INTELMQ?
>>>
>>> Hi all,
>>>
>>> I think there is currently no ongoing work for any improved statistics
>>> directly in the IntelMQ, but if you have development capacity to extend
>>> the current state, I think it could be useful.
>>>
>>> However, I can say what is currently available, and how I use/plan to
>>> use it:
>>>
>>> 1) for final events, we use - as mentioned by Aaron - a database. We are
>>> going to use Timescale DB, as described in:
>>> https://docs.intelmq.org/develop/admin/database/postgresql/#using-eventdb-with-timescale-db
>>> Currently we have some set of scripts generating stats, and we plan to
>>> move fully to TimescaleDB+Grafana.
>>>
>>> 2) for monitoring the ongoing work, there are basic stats exposed in the
>>> database 3 in Redis (see changelog:
>>> https://docs.intelmq.org/develop/changelog/?h=statistics#configurations).
>>> I think this feature isn't well documented. It's not perfect, but I use
>>> it to keep an eye on the botnet & failures, using a custom scripts to
>>> integrate it with CheckMK monitoring and alert on troubles.
>>>
>>> Best regards
>>>
>>> // Kamil Mańkowski <mankowski at cert.at> - T: +43 676 898 298 7204
>>> // CERT Austria - https://www.cert.at/
>>> // CERT.at GmbH, FB-Nr. 561772k, HG Wien
>>>
>>> On 11/27/23 11:51, L. Aaron Kaplan wrote:
>>>> Hi,
>>>>
>>>>
>>>> Most users of intelmq use an "eventsDB" as an output . From there, it is usually quite doable to do stats on top of events.
>>>>
>>>> I did an initial version for CERT.at back then which is still here: https://github.com/certtools/stats-portal
>>>> You can build on top of this if it suits you.
>>>>
>>>> Hope it helps,
>>>> Aaron.
>>>>      
>>>>
>>>>> On 27.11.2023, at 11:37, Homma, L.J. (Luitzen) via IntelMQ-dev <intelmq-dev at lists.cert.at> wrote:
>>>>>
>>>>> Dear IntelMQ Developers & Users,
>>>>>      
>>>>> We are curious if there are any plans on the roadmap to incorporate statistical features into IntelMQ. About 1.5 years ago, we participated in an online session where it was mentioned that there were some early plans to integrate stats into IntelMQ. As far as we could find, there have not been any steps in this direction. Are we correct?
>>>>>      
>>>>> Currently, we are working in our experimental environment to develop stats based on a Prometheus bot, using Prometheus as a time-series database, and utilizing Grafana for dashboarding and visualization. Are there more members of the community working on this? Our goal is to gain better insights into the input, filtering, and output of our IntelMQ pipeline(s). We hope to hear from others about their thoughts on this.
>>>>>      
>>>>>      
>>>>> Met vriendelijke groet,
>>>>>      
>>>>> Luitzen Homma
>>>>> Dit bericht kan informatie bevatten die niet voor u is bestemd. Indien u
>>>>> niet de geadresseerde bent of dit bericht abusievelijk aan u is gezonden,
>>>>> wordt u verzocht dat aan de afzender te melden en het bericht te
>>>>> verwijderen.
>>>>> De Staat aanvaardt geen aansprakelijkheid voor schade, van welke aard
>>>>> ook, die verband houdt met risico's verbonden aan het elektronisch
>>>>> verzenden van berichten.
>>>>>
>>>>> This message may contain information that is not intended for you. If you
>>>>> are not the addressee or if this message was sent to you by mistake, you
>>>>> are requested to inform the sender and delete the message.
>>>>> The State accepts no liability for damage of any kind resulting from the
>>>>> risks inherent in the electronic transmission of messages. _______________________________________________
>>>>> IntelMQ-dev mailing list
>>>>> https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
>>>>> https://intelmq.readthedocs.io/
>>>>
>>>> _______________________________________________
>>>> IntelMQ-dev mailing list
>>>> https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
>>>> https://intelmq.readthedocs.io/
>>>
>>> _______________________________________________
>>> IntelMQ-dev mailing list
>>> https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
>>> https://intelmq.readthedocs.io/
>>> _______________________________________________
>>> IntelMQ-dev mailing list
>>> https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
>>> https://intelmq.readthedocs.io/
>>
>> _______________________________________________
>> IntelMQ-dev mailing list
>> https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
>> https://intelmq.readthedocs.io/
>> _______________________________________________
>> IntelMQ-dev mailing list
>> https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
>> https://intelmq.readthedocs.io/
> 
> _______________________________________________
> IntelMQ-dev mailing list
> https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
> https://intelmq.readthedocs.io/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.cert.at/pipermail/intelmq-dev/attachments/20231127/d0448fed/attachment.sig>