Tomás, Sebastian

Thanks for sharing the proposals.
I would like to share some insights from working with intelmq with roughly 70 feeds. I have frequently run into these problems and tried to solve these on my own.
I have submitted the PR #953 [1] based on my naive attempts. This script converts collectors to systemd services, it is not production ready however it is still helpful.

There are some concerns if systemd is a right solution. I believe it is. There are some aspects of systemd which are appealing and helpful. Running the bots as intelmq user is a breeze, with User and Group directives.
However one of the biggest gains is with RandomizedDelaySec directive. Let me explain why and how it helps. My VM has about 3.5G RAM and I am running about 70 collectors + parsers and couple of experts. Every collector has its own interval which is generally one of 1 hr, 4 hrs, 6 hrs , 12 hrs or 24 hours. Now when an hour approaches all the collectors will start at once, and since collectors keep collected data as single message, in memory, the machine runs will be OOM, as some feeds have large datasets. With RandomizedDelaySec systemd will spread the execution over a period thus preventing this sudden rush for memory. This was very helpful.

I understand that I am about to expand the discussion here, however I feel it is connected issue. There should be a way to prevent running multiple instances of bot with same id. As I see it, collectors and parsers though different are tightly coupled. There is no point in keeping the parser running in memory while the collector is not running and parser queue is empty. If you will go though my commits on the PR, you will realize, that I tried to do this by finding directly connected bots which have single input and single output in the chain. The idea is, for each collector, find all the bots which are directly connected ie single input and single output starting from collector. All these bots can be treated as single unit, because they run after the collector, not necessarily after one another. Now run collector from systemd timer and service. After the collector is finished start all these bots. However I discovered that multiple instances of bot could run, creating problems.

Another thing, which might be worth discussion is, collectors should have a flag, to save the collected input to a file, and parser could then potentially pick from queue or file. This will help in cases where the input size is relative large, eg blueliv or alienvault (subscribed to lot of pulses, reminds me I need to submit a PR for this enhancement). May be some enhancements to fileinput/fileoutput bot can do that, I haven't really explored it, however an integrated approach would be much better, imo.

Following is unrelated to the proposal at hand, however in the interest of creating a scalable and stable intelmq deployment, I see some more hurdles, which I am not expounding upon, since they are not really related to the proposal. At the same time, expanding the topic towards scalability discussion is worthwhile. These of course can be revisited and discussed in detail at some late stage
a. Replace redis as queue with something persistent. As present redis uses a lot of memory since it keeps the events in memory. if your feeds are getting data frequently and, in the chain, you have a slow processing expert, queue size keeps growing and so does the redis memory usage.
b. multiple events processing by single bot, This has been discussed a lot in issues and mailing lists. I have an implementation using gevents[2]. However there are problems with this, those trade-offs I am fine with. c & d might help to resolve these issues.
c. events should have IDs. This will help in acknowledging the correct message in case of multi processing wrt to b.
d. bots should be able to peek at message count in the source queue. This will help with b. as well as backoff algorithm discussed at other places, iirc  Sebastian proposed it on some github issues. this really simple, I had written the peek function however I cannot locate it as of now.

-N

[1] https://github.com/certtools/intelmq/pull/953
[2] https://github.com/navtej/intelmq/blob/gevent/intelmq/bots/experts/gethostbyname/expert.py

On Thu, Apr 20, 2017 at 1:52 PM, Bernhard Reiter <bernhard@intevation.de> wrote:
Hi Tomás,

thanks for preparing a proposal!

I've just completed my first reading in the last hour.
Before sending feedback later today, a quick remark:

> https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md

Right now I cannot access the two architecture diagrams.
(Error message: "Error Fechting Resource").

Best Regards,
Bernhard

--
www.intevation.de/~bernhard   +49 541 33 508 3-3
Intevation GmbH, Osnabrück, DE; Amtsgericht Osnabrück, HRB 18998
Geschäftsführer Frank Koormann, Bernhard Reiter, Dr. Jan-Oliver Wagner