[Intelmq-dev] Proposal (Request For Comments) - IntelMQ with Run Modes & Process Management

Fri Apr 21 21:00:11 CEST 2017

Tomás, Sebastian

Thanks for sharing the proposals.
I would like to share some insights from working with intelmq with roughly
70 feeds. I have frequently run into these problems and tried to solve
these on my own.
I have submitted the PR #953 [1] based on my naive attempts. This script
converts collectors to systemd services, it is not production ready however
it is still helpful.

There are some concerns if systemd is a right solution. I believe it is.
There are some aspects of systemd which are appealing and helpful. Running
the bots as intelmq user is a breeze, with User and Group directives.
However one of the biggest gains is with RandomizedDelaySec directive. Let
me explain why and how it helps. My VM has about 3.5G RAM and I am running
about 70 collectors + parsers and couple of experts. Every collector has
its own interval which is generally one of 1 hr, 4 hrs, 6 hrs , 12 hrs or
24 hours. Now when an hour approaches all the collectors will start at
once, and since collectors keep collected data as single message, in
memory, the machine runs will be OOM, as some feeds have large datasets.
With RandomizedDelaySec systemd will spread the execution over a period
thus preventing this sudden rush for memory. This was very helpful.

I understand that I am about to expand the discussion here, however I feel
it is connected issue. There should be a way to prevent running multiple
instances of bot with same id. As I see it, collectors and parsers though
different are tightly coupled. There is no point in keeping the parser
running in memory while the collector is not running and parser queue is
empty. If you will go though my commits on the PR, you will realize, that I
tried to do this by finding directly connected bots which have single input
and single output in the chain. The idea is, for each collector, find all
the bots which are directly connected ie single input and single output
starting from collector. All these bots can be treated as single unit,
because they run after the collector, not necessarily after one another.
Now run collector from systemd timer and service. After the collector is
finished start all these bots. However I discovered that multiple instances
of bot could run, creating problems.

Another thing, which might be worth discussion is, collectors should have a
flag, to save the collected input to a file, and parser could then
potentially pick from queue or file. This will help in cases where the
input size is relative large, eg blueliv or alienvault (subscribed to lot
of pulses, reminds me I need to submit a PR for this enhancement). May be
some enhancements to fileinput/fileoutput bot can do that, I haven't really
explored it, however an integrated approach would be much better, imo.

Following is unrelated to the proposal at hand, however in the interest of
creating a scalable and stable intelmq deployment, I see some more hurdles,
which I am not expounding upon, since they are not really related to the
proposal. At the same time, expanding the topic towards scalability
discussion is worthwhile. These of course can be revisited and discussed in
detail at some late stage
a. Replace redis as queue with something persistent. As present redis uses
a lot of memory since it keeps the events in memory. if your feeds are
getting data frequently and, in the chain, you have a slow processing
expert, queue size keeps growing and so does the redis memory usage.
b. multiple events processing by single bot, This has been discussed a lot
in issues and mailing lists. I have an implementation using gevents[2].
However there are problems with this, those trade-offs I am fine with. c &
d might help to resolve these issues.
c. events should have IDs. This will help in acknowledging the correct
message in case of multi processing wrt to b.
d. bots should be able to peek at message count in the source queue. This
will help with b. as well as backoff algorithm discussed at other places,
iirc  Sebastian proposed it on some github issues. this really simple, I
had written the peek function however I cannot locate it as of now.

-N

[1] https://github.com/certtools/intelmq/pull/953
[2]
https://github.com/navtej/intelmq/blob/gevent/intelmq/bots/experts/gethostbyname/expert.py

On Thu, Apr 20, 2017 at 1:52 PM, Bernhard Reiter <bernhard at intevation.de>
wrote:

> Hi Tomás,
>
> thanks for preparing a proposal!
>
> I've just completed my first reading in the last hour.
> Before sending feedback later today, a quick remark:
>
> > https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md
>
> Right now I cannot access the two architecture diagrams.
> (Error message: "Error Fechting Resource").
>
> Best Regards,
> Bernhard
>
> --
> www.intevation.de/~bernhard   +49 541 33 508 3-3
> Intevation GmbH, Osnabrück, DE; Amtsgericht Osnabrück, HRB 18998
> Geschäftsführer Frank Koormann, Bernhard Reiter, Dr. Jan-Oliver Wagner
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cert.at/pipermail/intelmq-dev/attachments/20170422/29bad8d3/attachment.html>