Hi IntelMQ-ML,
Before open the discussion, I think is important to write a few notes:
1) The topic that we are opening to discuss have been discussed between me, Aaron and Sebastian in order to understand the impact of the possible changes, the effort require, the complexity, the global perspective of how should be implemented, etc... Each of us has now an idea and perspective about it and is crucial to have the community involved from now on in order to agree on the way to proceed.
2) The proposal that will be shared here is my own perspective but NOT only my work because a lot of the structure and technical details was only possible with Sebastian and Aaron contributions ( Thank you Aaron and Sebastian ), even if in some specific details they might see that there is a space to do in other/better way. This thread will be a good place to listen everyone's perspective. :)
3) IMPORTANT: This proposal is just a proposal and does NOT mean that will be implemented in this way... it's only a base to discuss if it helps.
About the Proposal: --------------------------- The proposal is available on the following link and tries to be clear for the readers although its possible to have some hide details (by mistake) that will raise some questions.
Proposal: https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md
Reasons why we are starting this discussion is because two main things (I guess):
1) There is a need to configure bots to only execute in a specific time, therefore, it seems that there is a requirement to configure a bot in different run modes, in this proposal: scheduled and continuous. (see proposal for more details).
2) IntelMQ is now being used by multiple teams and it requires stability during execution time, etc... it seems that there is a need for integrations with tools like systemd.
Please, if you have time, read the proposal and write to the mailing-list your thoughts spitted by "What you like" and "What you don't like".
I hope that I didn't have forgot to mention something important... :)
Thank you in advance, Regards
Hi Tomás,
thanks for preparing a proposal!
I've just completed my first reading in the last hour. Before sending feedback later today, a quick remark:
https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md
Right now I cannot access the two architecture diagrams. (Error message: "Error Fechting Resource").
Best Regards, Bernhard
Seems github's cache is broken. The original URLs work fine for me:
https://s9.postimg.org/9s0bne4n3/intelmq-bots-management-with-systemd.png https://s11.postimg.org/lnzdjslrn/intelmq_bots_management_with_pid.png
Sebastian
On 04/20/2017 10:22 AM, Bernhard Reiter wrote:
Hi Tomás,
thanks for preparing a proposal!
I've just completed my first reading in the last hour. Before sending feedback later today, a quick remark:
https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md
Right now I cannot access the two architecture diagrams. (Error message: "Error Fechting Resource").
Best Regards, Bernhard
Intelmq-dev mailing list Intelmq-dev@lists.cert.at http://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
Hello! :)
I uploaded the images to the repo and updated the links on the proposal. I think now is fine. :)
Thank you for the heads-up. Cheers
On Thu, Apr 20, 2017 at 9:28 AM, Sebastian Wagner wagner@cert.at wrote:
Seems github's cache is broken. The original URLs work fine for me:
https://s9.postimg.org/9s0bne4n3/intelmq-bots-management-with-systemd.png https://s11.postimg.org/lnzdjslrn/intelmq_bots_management_with_pid.png
Sebastian
On 04/20/2017 10:22 AM, Bernhard Reiter wrote:
Hi Tomás,
thanks for preparing a proposal!
I've just completed my first reading in the last hour. Before sending feedback later today, a quick remark:
https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md
Right now I cannot access the two architecture diagrams. (Error message: "Error Fechting Resource").
Best Regards, Bernhard
Intelmq-dev mailing listIntelmq-dev@lists.cert.athttp://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
-- // Sebastian Wagner wagner@cert.at wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg
Intelmq-dev mailing list Intelmq-dev@lists.cert.at http://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
Tomás, Sebastian
Thanks for sharing the proposals. I would like to share some insights from working with intelmq with roughly 70 feeds. I have frequently run into these problems and tried to solve these on my own. I have submitted the PR #953 [1] based on my naive attempts. This script converts collectors to systemd services, it is not production ready however it is still helpful.
There are some concerns if systemd is a right solution. I believe it is. There are some aspects of systemd which are appealing and helpful. Running the bots as intelmq user is a breeze, with User and Group directives. However one of the biggest gains is with RandomizedDelaySec directive. Let me explain why and how it helps. My VM has about 3.5G RAM and I am running about 70 collectors + parsers and couple of experts. Every collector has its own interval which is generally one of 1 hr, 4 hrs, 6 hrs , 12 hrs or 24 hours. Now when an hour approaches all the collectors will start at once, and since collectors keep collected data as single message, in memory, the machine runs will be OOM, as some feeds have large datasets. With RandomizedDelaySec systemd will spread the execution over a period thus preventing this sudden rush for memory. This was very helpful.
I understand that I am about to expand the discussion here, however I feel it is connected issue. There should be a way to prevent running multiple instances of bot with same id. As I see it, collectors and parsers though different are tightly coupled. There is no point in keeping the parser running in memory while the collector is not running and parser queue is empty. If you will go though my commits on the PR, you will realize, that I tried to do this by finding directly connected bots which have single input and single output in the chain. The idea is, for each collector, find all the bots which are directly connected ie single input and single output starting from collector. All these bots can be treated as single unit, because they run after the collector, not necessarily after one another. Now run collector from systemd timer and service. After the collector is finished start all these bots. However I discovered that multiple instances of bot could run, creating problems.
Another thing, which might be worth discussion is, collectors should have a flag, to save the collected input to a file, and parser could then potentially pick from queue or file. This will help in cases where the input size is relative large, eg blueliv or alienvault (subscribed to lot of pulses, reminds me I need to submit a PR for this enhancement). May be some enhancements to fileinput/fileoutput bot can do that, I haven't really explored it, however an integrated approach would be much better, imo.
Following is unrelated to the proposal at hand, however in the interest of creating a scalable and stable intelmq deployment, I see some more hurdles, which I am not expounding upon, since they are not really related to the proposal. At the same time, expanding the topic towards scalability discussion is worthwhile. These of course can be revisited and discussed in detail at some late stage a. Replace redis as queue with something persistent. As present redis uses a lot of memory since it keeps the events in memory. if your feeds are getting data frequently and, in the chain, you have a slow processing expert, queue size keeps growing and so does the redis memory usage. b. multiple events processing by single bot, This has been discussed a lot in issues and mailing lists. I have an implementation using gevents[2]. However there are problems with this, those trade-offs I am fine with. c & d might help to resolve these issues. c. events should have IDs. This will help in acknowledging the correct message in case of multi processing wrt to b. d. bots should be able to peek at message count in the source queue. This will help with b. as well as backoff algorithm discussed at other places, iirc Sebastian proposed it on some github issues. this really simple, I had written the peek function however I cannot locate it as of now.
-N
[1] https://github.com/certtools/intelmq/pull/953 [2] https://github.com/navtej/intelmq/blob/gevent/intelmq/bots/experts/gethostby...
On Thu, Apr 20, 2017 at 1:52 PM, Bernhard Reiter bernhard@intevation.de wrote:
Hi Tomás,
thanks for preparing a proposal!
I've just completed my first reading in the last hour. Before sending feedback later today, a quick remark:
https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md
Right now I cannot access the two architecture diagrams. (Error message: "Error Fechting Resource").
Best Regards, Bernhard
-- www.intevation.de/~bernhard +49 541 33 508 3-3 Intevation GmbH, Osnabrück, DE; Amtsgericht Osnabrück, HRB 18998 Geschäftsführer Frank Koormann, Bernhard Reiter, Dr. Jan-Oliver Wagner
Hi Navtej,
Am Freitag 21 April 2017 21:00:11 schrieb Navtej Singh:
I would like to share some insights from working with intelmq with roughly 70 feeds. I have frequently run into these problems and tried to solve these on my own.
thanks for adding your experiences and approaches. I believe in coming up with a number of ideas, trying some and then find a good solution, so it is good to see your approaches.
There are some concerns if systemd is a right solution. I believe it is. There are some aspects of systemd which are appealing and helpful. Running the bots as intelmq user is a breeze, with User and Group directives. However one of the biggest gains is with RandomizedDelaySec directive.
If we had a process manager that knows how the bots are wired, it could just queue some one time collectors behind each other if the insertion point before experts is already loaded. So I don't think this is coupled to systemd in particular, though the RandomizedDelaySec sounds interesting for some simple use cases.
I understand that I am about to expand the discussion here, however I feel it is connected issue. There should be a way to prevent running multiple instances of bot with same id. As I see it, collectors and parsers though different are tightly coupled.
To me this sounds like a use case that should be considered in this discussion. See my other post (a few minuted ago) where I explain why I consider this kind of "flow control" relevant with your example.
a. Replace redis as queue with something persistent. As present redis uses a lot of memory since it keeps the events in memory. if your feeds are getting data frequently and, in the chain, you have a slow processing expert, queue size keeps growing and so does the redis memory usage.
I also consider this a "flow control" issue, stop inserting stuff if the downstream pipe is full. Which technically could mean that redis has used the configured memory.
b. multiple events processing by single bot, This has been discussed a lot in issues and mailing lists. I have an implementation using gevents[2]. However there are problems with this, those trade-offs I am fine with. c & d might help to resolve these issues.
Can you point me to a more elaborate outline of the problem? (I always thought that a bot can already process several events, but you mean per network event?)
c. events should have IDs. This will help in acknowledging the correct message in case of multi processing wrt to b.
My mental model tells me that the information about an abuse sighting is the same, it shall be the "same" for intelmq, so an ID wouldn't help. Somehow intelmq must record the contents of the "events" and deduplicate anyway.
d. bots should be able to peek at message count in the source queue. This will help with b. as well as backoff algorithm discussed at other places, iirc Sebastian proposed it on some github issues. this really simple, I had written the peek function however I cannot locate it as of now.
This sounds like the bots implementing some "flow control" itself. From a design perspective I think the bot shall known and somehow register what it wants to do or handle, however the control seems feasable from an oversight process from my perspective.
Best Regards, Bernhard
Hi Navtej,
Am Freitag 21 April 2017 21:00:11 schrieb Navtej Singh:
I would like to share some insights from working with intelmq with roughly 70 feeds. I have frequently run into these problems and tried to solve these on my own.
thanks for adding your experiences and approaches. I believe in coming up with a number of ideas, trying some and then find a good solution, so it is good to see your approaches.
There are some concerns if systemd is a right solution. I believe it is. There are some aspects of systemd which are appealing and helpful. Running the bots as intelmq user is a breeze, with User and Group directives. However one of the biggest gains is with RandomizedDelaySec directive.
If we had a process manager that knows how the bots are wired, it could just queue some one time collectors behind each other if the insertion point before experts is already loaded. So I don't think this is coupled to systemd in particular, though the RandomizedDelaySec sounds interesting for some simple use cases.
I understand that I am about to expand the discussion here, however I feel it is connected issue. There should be a way to prevent running multiple instances of bot with same id. As I see it, collectors and parsers though different are tightly coupled.
To me this sounds like a use case that should be considered in this discussion. See my other post (a few minuted ago) where I explain why I consider this kind of "flow control" relevant with your example.
a. Replace redis as queue with something persistent. As present redis uses a lot of memory since it keeps the events in memory. if your feeds are getting data frequently and, in the chain, you have a slow processing expert, queue size keeps growing and so does the redis memory usage.
I also consider this a "flow control" issue, stop inserting stuff if the downstream pipe is full. Which technically could mean that redis has used the configured memory.
b. multiple events processing by single bot, This has been discussed a lot in issues and mailing lists. I have an implementation using gevents[2]. However there are problems with this, those trade-offs I am fine with. c & d might help to resolve these issues.
Can you point me to a more elaborate outline of the problem? (I always thought that a bot can already process several events, but you mean per network event?)
I meant the bots should be able to process messages in parallel. At present a a single bot processes messages linearly. However that is too slow for expert bots which query external services eg gethostbyname. The processing of such bots can be increased if the expert bot can process multiple events in parallel. My implementation of it is by using gevent based green threads.
c. events should have IDs. This will help in acknowledging the correct message in case of multi processing wrt to b.
My mental model tells me that the information about an abuse sighting is the same, it shall be the "same" for intelmq, so an ID wouldn't help. Somehow intelmq must record the contents of the "events" and deduplicate anyway.
This needs bit of explanation, the current implementation of acknowledge for redis back-end uses rpop. It indiscriminately picks the rightmost event and acknowledges it. In multi processing mode, this is undesirable because threads can return in non linear fashion. Example, lets us assume that there are following five hostnames to be resolved in -internal queue and we spawn five threads, in the order with a being at rightmost a.in goes to thread1 b.in goes to thread2 c.in goes to thread3 d.in goes to thread4 e.in goes to thread5 Now if the thread2 returns first, it will end up acknowledging a.in instead of b.in and at the end of it we will have e.in remaining in -internal queue even though it got processed successfully.
If we dont want an ID something else has to be there to acknowledge correct msg in multiprocessing env.
load_balance option does not scale. I think gevent/ayncio are probably the best ones at present.
d. bots should be able to peek at message count in the source queue. This will help with b. as well as backoff algorithm discussed at other places, iirc Sebastian proposed it on some github issues. this really simple, I had written the peek function however I cannot locate it as of now.
This sounds like the bots implementing some "flow control" itself. From a design perspective I think the bot shall known and somehow register what it wants to do or handle, however the control seems feasable from an oversight process from my perspective.
Best Regards, Bernhard
-- www.intevation.de/~bernhard +49 541 33 508 3-3
Hi,
Some comments:
(I'm not fully up to date on IntelMQ internals, so I might be off.)
On 21.04.2017 21:00, Navtej Singh wrote:
With RandomizedDelaySec systemd will spread the execution over a period thus preventing this sudden rush for memory. This was very helpful.
I would be wary about relying on randomization. Random numbers have the property that every now and then they are all identical.
So I'd consider that to be more of a CPU load-distribution and not as a fix for the RAM usage.
Another thing, which might be worth discussion is, collectors should have a flag, to save the collected input to a file, and parser could then potentially pick from queue or file. This will help in cases where the input size is relative large, eg blueliv or alienvault (subscribed to lot of pulses, reminds me I need to submit a PR for this enhancement). May be some enhancements to fileinput/fileoutput bot can do that, I haven't really explored it, however an integrated approach would be much better, imo.
IMHO there are multiple issues:
a) how to pass huge amounts of data between bots b) how to process larger data-sets
Ad a)
yes, passing a reference to a file (filename?) instead of the content of the file is one option. It may well be that using a different Message-Passing backend (e.g. Rabbit-MQ) might also solve the issue.
Ad b)
IMHO much more tricky is the issue of actually processing huge data-sets. Once you reach file-sizes in the GB range one needs to switch from "load everything into a data-structure in RAM, then process it" to a "load next few KB from a data-stream, process it, then get next slice".
My worry is that the current bot API cannot be easily converted to stream processing.
We need to think this through.
a. Replace redis as queue with something persistent. As present redis uses a lot of memory since it keeps the events in memory. if your feeds are getting data frequently and, in the chain, you have a slow processing expert, queue size keeps growing and so does the redis memory usage.
Yes.
b. multiple events processing by single bot, This has been discussed a lot in issues and mailing lists. I have an implementation using gevents[2]. However there are problems with this, those trade-offs I am fine with. c & d might help to resolve these issues.
Yes, some experts would be a **lot** more efficient if they can do bulk processing.
c. events should have IDs. This will help in acknowledging the correct message in case of multi processing wrt to b.
Yes, but for a different reason: Assume more CERTs that do IntelMQ-IntelMQ cross-connects. You need a way to avoid building forwarding-loops. Persistent IDs can help (analogue to Message-IDs in the Usenet context).
(btw the Usenet analogy: some sort of Path: header would also be helpful: a list of Systems that this event has already passed through.)
otmar
Hi,
Am Mittwoch 26 April 2017 17:24:03 schrieb Otmar Lendl:
b) how to process larger data-sets
IMHO much more tricky is the issue of actually processing huge data-sets. Once you reach file-sizes in the GB range one needs to switch from "load everything into a data-structure in RAM, then process it" to a "load next few KB from a data-stream, process it, then get next slice".
note that there is code to split up line-based data, such as CSV, see https://github.com/certtools/intelmq/pull/680
c. events should have IDs. This will help in acknowledging the correct message in case of multi processing wrt to b.
Yes, but for a different reason: Assume more CERTs that do IntelMQ-IntelMQ cross-connects. You need a way to avoid building forwarding-loops. Persistent IDs can help (analogue to Message-IDs in the Usenet context).
The problem I see with this approach is that we do not have one origin of the information that could create a unique id for it. Let us say, two observing systems notice the same "abuse event" on a machine somewhere and start processing it, they might start two different ids for the same event. Just checking the id later for duplicate would not help.
Or let us say a single event gets an UID runs through two systems with processing it slight differently and then end up in one abuse system via two different sources. It is the same UID then, but different data details. Just rejecting the second incoming report on bases of the UID would throw additional info away and does not seem to be enough.
This is why I still think that one system should have a (working code) definition when it consideres two events being equal and then applying it to each report for deduplication and prevention of forward-loops.
(btw the Usenet analogy: some sort of Path: header would also be helpful: a list of Systems that this event has already passed through.)
Email and news headers cannot be trusted much, what would be gain from the info?
Best, Bernhard
On 04/26/2017 05:24 PM, Otmar Lendl wrote:
Ad b) IMHO much more tricky is the issue of actually processing huge data-sets. Once you reach file-sizes in the GB range one needs to switch from "load everything into a data-structure in RAM, then process it" to a "load next few KB from a data-stream, process it, then get next slice".
My worry is that the current bot API cannot be easily converted to stream processing.
We need to think this through.
The ParserBot[1] uses generators (i.e. processing one line after another) except for one detail: Base64 decoding of `raw` - IMHO we should get rid of that anyway, it just blows up the size. Redis can handle the data without base64 just fine.
All parsers derived from ParserBot only overwrite single methods, they all work in the same way. But note that not all Parsers we have are converted to the ParserBot class yet, but that's nothing spectacular.
Sebastian
[1]: https://github.com/certtools/intelmq/blob/1.0.0.dev6/intelmq/lib/bot.py#L453
Dear Tomás, dear Intelmqers,
Am Donnerstag, 23. März 2017 20:00:05 schrieb Tomás Lima:
Proposal: https://github.com/SYNchroACK/intelmq/blob/proposal/docs/proposal.md
thanks to you, Sebastian and Aaron for working out a proposal and starting a discussion.
Reasons why we are starting this discussion is because two main things (I guess):
- There is a need to configure bots to only execute in a specific time,
therefore, it seems that there is a requirement to configure a bot in different run modes, in this proposal: scheduled and continuous. (see proposal for more details).
I agree that there are periodic tasks to be done. Examples include: * periodic renewal of support data, e.g. importing abuse data from ripe * fetching feeds after a specific time of the day
Proposing several run modes like you did (startup, scheduled, one-shot, continuous) is one way of solving this need.
There are others that should be considered as well: An alternative would be: Each bot gets helping functions that allows it to be woken up on some events and event types include a sleep until an alarm goes off. The time of being alarmed can be set by the bot itself. This way only one run-mode is necessary as a bot can see if it still wants to configure something (like "on_boot") , run or be woken up later at a certain time.
- IntelMQ is now being used by multiple teams and it requires stability
during execution time, etc... it seems that there is a need for integrations with tools like systemd.
Again I agree with the requirement that intelmq shall run stably and that it increasingly need to. The reason is that usage is supposed to be more in production and a setup will be less likely to get care from administrators which happen to be intelmq-developer at the same time.
However again I'm unsure if systemd is a good solution. If a solution is implemented I think it should be the only way to run intelmq, to reduce complexity and maintenance costs in the end. Using systemd for running bots thus means adding a hard dependency on systemd and thus restricting intelmq to be able to run on GNU/Linux-systems that come with systemd. It would exclude some operating systems like the netbsd/freebsd/openbsd group, or even more exotic ones like Illumos. The restriction to GNU/Linux systems with systemd is something I could live with.
My second concern regarding systemd is more important: systemd is a general process manager, for intelmq we may need a process manager that implements tactics specific to abuse handling and intelmq's design. For example about the question: What shall be done if we get congestion in intelmq? Should starting of scheduled bots runs be queued up or dropped? Which ones are more important than others? My gut feeling is that we will need some sort of flow control and that systemd will be unable to manage that without additional stunts.
Please, if you have time, read the proposal and write to the mailing-list your thoughts spitted by "What you like" and "What you don't like".
(Just doing so, sorry that it took a few days, I was time pressed to get the webfrontend to an IntelMQ setup we call intelmq-cb-mailgen released in a beta version and then had a few days off. So here we go:)
In the == intelmqctl principles:
1) The first two seems to collide: "in the background" and "interactive" mode does not go together. Overall it is okay to be able to start processes in the foreground to me, may it be for diagnostic purposes or interaction. I guess the first should should be rephrased to say "by default bots are started in the background" or "unless a different option is given".
I'm less sure that everything shall be a "bot", though. (BTW: The phrasing "bot" still feels odd, I'm not sure what it really means in comparison to other daemons or processes.)
2) What does "provide the best log messages" mean? Each diagnostic message added to the code will be in best faith anyway and using the highest log-level in all circumstances seems undesireable.
3) Principle of checking configuration and trying to repair it: If intelmqctl can repair/clean a configuration, why should it stop and ask for a rerun? It seems to have enough information to continue, so it could continue right away.
== more thoughts/questions
My suggestion is to come up with language that indicates if a bot is "enabled" in a sense that it shall run according to its spec or "disabled" and that "is running" or "is not running" means if it actually has the program pointer or not. Right now we have three states: * shall be autostarted on reboot or not (called "enabled") * shall be running always or woken up at certain times (called "started") * has active control about a program pointer (in one thread or process), not called anything yet.
The --oneshot flag seems to be there for diagnostic purposes, but shouldn't it be a run-mode then? Up to this point it seems we are having four run modes. :) (This somehow does not match what I see under intelmqctl start 2.3+2.4 where --oneshot is given to all starts. It would mean that schedules bots would only process one message??)
The "reload" action is unclear to me, why making things more complicated? From my perspective doing more than "restart" would only be helpful if there are resources to be saved (startup costs) or data to be keep from being lost. However intelmq shall never lose data anyway because of its message bus design and I cannot imagine resources that are worth adding the complexity of a reload.
So after reading the proposal for the first time I believe that I have a hard time evaluating it because it does not explain clear enough what kind of problems it is going to solve. Such an explanation should include examples to be able to judge if we a solution would be a good.
(In addition potential alternatives are not discussed, though this is less important than giving the problems that shall be solved.)
Personally I am lacking in depth experience with systemd myself so I cannot judge its abilities. I have a tendency to like designs that I can understand and implement myself, systemd is mainly a black box for me, so I do not know what kind of advantages or disadvantages it bears for intelmq and they are not explained so far.)
My intention is to help you and us all to build a great system, this is why I'm trying to be clear about what I do not understand. You have asked for my opinion, here it is. I hope it is helpful.
Best Regards, Bernhard
Hi,
On 04/21/2017 10:48 AM, Bernhard Reiter wrote:
thanks to you, Sebastian and Aaron for working out a proposal
I actually have a small other proposal here for comparison: https://github.com/wagner-certat/intelmq/blob/proposal/docs/proposal.md It's also an outcome of the same discussion, with some differences and simplifications. But it tackles less issues.
I try to give some answers where possible. Tomas, please correct me if I am wrong.
There are others that should be considered as well: An alternative would be: Each bot gets helping functions that allows it to be woken up on some events and event types include a sleep until an alarm goes off. The time of being alarmed can be set by the bot itself. This way only one run-mode is necessary as a bot can see if it still wants to configure something (like "on_boot") , run or be woken up later at a certain time.
The idea is to keep intelmq simple (which is one of the design goals from the beginnings of this projects) and use existing and well-known tools instead of implementing our own bunch of bugs.
However again I'm unsure if systemd is a good solution.
Personally, I want to keep the PID-based approach and encourage developers to provide support for supervisord etc. Making systemd a hard-requirement is not *my* intention. However, my idea and intention is to not implement the process management ourselves, but maybe we can't avoid it because of:
My second concern regarding systemd is more important: systemd is a general process manager, for intelmq we may need a process manager that implements tactics specific to abuse handling and intelmq's design. For example about the question: What shall be done if we get congestion in intelmq? Should starting of scheduled bots runs be queued up or dropped? Which ones are more important than others? My gut feeling is that we will need some sort of flow control and that systemd will be unable to manage that without additional stunts.
Flow control is definitely an issue and a big topic we should discuss in depth. We (certat) do not need flow control currently but maybe you do?
"in the background" and "interactive" mode does not go together. Overall it is okay to be able to start processes in the foreground to me, may it be for diagnostic purposes or interaction. I guess the first should should be rephrased to say "by default bots are started in the background" or "unless a different option is given".
I think that's the meaning, yes. Basically the current behavior of `intelmqctl start/run`, just renamed.
The "reload" action is unclear to me, why making things more complicated? From my perspective doing more than "restart"
Does "reload" more than "restart"? AFAIU, they are performing the same checks. The only difference is, that restart stops/starts the running continuous bots, and reload sends sighup to those.
Sebastian
Moin,
Am Freitag 21 April 2017 12:26:12 schrieb Sebastian Wagner:
https://github.com/wagner-certat/intelmq/blob/proposal/docs/proposal.md It's also an outcome of the same discussion, with some differences and simplifications. But it tackles less issues.
what issues are we trying to solve? My feeling increased that it would help us, if we pull together the problems we are trying to solve and write them down. Even if it is just bullet-points or keywords.
The idea is to keep intelmq simple (which is one of the design goals from the beginnings of this projects) and use existing and well-known tools instead of implementing our own bunch of bugs.
I agree with this design goal. As simple as possible, but not simpler than necessary.
If we want intelmq simple, my strong recommendation is: a) implement (and thus support) only one process management solution. So if a proposal including systemd is considered the leading solution after the discussion, we should implement it and remove other process management approaches. b) We should reduce the number of run modes as much as we can, as it adds complexity in thinking and coding. If all processes shall be "bots", then bots itself should decide how often they run. It should be within their code. So my idea, briefly outlined, may actually be the simpler solution. (There may be other ideas as well.) c) remove the "reload" option. So far I think the potential benefits are outweigthed by the cost.
However again I'm unsure if systemd is a good solution.
Personally, I want to keep the PID-based approach and encourage developers to provide support for supervisord etc. Making systemd a hard-requirement is not *my* intention. However, my idea and intention is to not implement the process management ourselves, but maybe we can't avoid it because of:
As for a) above: we should only have to maintain and think through one solution. I think it is okay to make systemd a hard requirement if that is the leading implementation idea after scrutinising a number of ideas.
I really like using other components, however not at all costs. The component to be used must fit quite well, otherwise the break-even point is easily reached, where learning, adaption and adopting a different component is more work than rolling your own.
My second concern regarding systemd is more important: systemd is a general process manager, for intelmq we may need a process manager that implements tactics specific to abuse handling and intelmq's design. For example about the question: What shall be done if we get congestion in intelmq? Should starting of scheduled bots runs be queued up or dropped? Which ones are more important than others? My gut feeling is that we will need some sort of flow control and that systemd will be unable to manage that without additional stunts.
Flow control is definitely an issue and a big topic we should discuss in depth. We (certat) do not need flow control currently but maybe you do?
What I mean by flow control is that we take the relations between the bots into account and implement strategies and tactics based on intelmq specific information. In Navtej's friday post you can see how he makes use of this information and proposes improved solutions to steer the flow within intelmq. To me it feels like an intelmq process manager can do this much better, because it already know how the pipes are wired together.
Sooner or later I guess intelmq will need this kind of "flow control" to be able to fulfill its promise of providing a fast and fully automatable system. So it may become interesting to you at certat as well. :)
In our test runs we did see a number of congestions, partly due to some other defects or suboptimal configurations, but observing how the system recovers from these situations, it can certainly do much better.
The "reload" action is unclear to me, why making things more complicated? From my perspective doing more than "restart"
Does "reload" more than "restart"? AFAIU, they are performing the same checks. The only difference is, that restart stops/starts the running continuous bots, and reload sends sighup to those.
If it does not do more, get rid of it. (I thought it aims for doing more, but then bots would need to be prepared to flush some of their datastructure while running. It is much simpler for bot writers to just write for stop and start.)
Best Regards, Bernhard
Bernhard Reiter bernhard@intevation.de writes:
Am Freitag 21 April 2017 12:26:12 schrieb Sebastian Wagner:
Just my 2¢ on some specific point coming up:
[...]
If we want intelmq simple, my strong recommendation is: a) implement (and thus support) only one process management solution. So if a proposal including systemd is considered the leading solution after the discussion, we should implement it and remove other process management approaches.
Even if systemd turns out to be a good choice, I'd vote against making it a hard dependency. The reasons are very much the same Bernhard stated him self in an earlier post: it would make IntelMQ a Linux only product, which would be a shame given the overall open and portable design of it.
[...]
Flow control is definitely an issue and a big topic we should discuss in depth. We (certat) do not need flow control currently but maybe you do?
What I mean by flow control is that we take the relations between the bots into account and implement strategies and tactics based on intelmq specific information. In Navtej's friday post you can see how he makes use of this information and proposes improved solutions to steer the flow within intelmq. To me it feels like an intelmq process manager can do this much better, because it already know how the pipes are wired together.
Sooner or later I guess intelmq will need this kind of "flow control" to be able to fulfill its promise of providing a fast and fully automatable system. So it may become interesting to you at certat as well. :)
This might or might not be true, currently the problems we are observing are quite fundamental and don't need overly clever solutions. I'd like to point to the proposal Bernard Herzog made in issue 709 last year: https://github.com/certtools/intelmq/issues/709 it outlines a rather simple solution to much of the resource problems, and demonstrates how to build solutions that don't depend on an higher level service, with in depth knowledge of the bots interactions.
[...]
Does "reload" more than "restart"? AFAIU, they are performing the same checks. The only difference is, that restart stops/starts the running continuous bots, and reload sends sighup to those.
If it does not do more, get rid of it. (I thought it aims for doing more, but then bots would need to be prepared to flush some of their datastructure while running. It is much simpler for bot writers to just write for stop and start.)
Ack.
sascha
Hi,
Am Donnerstag 27 April 2017 14:16:33 schrieb Sascha Wilde:
Even if systemd turns out to be a good choice, I'd vote against making it a hard dependency. The reasons are very much the same Bernhard stated him self in an earlier post: it would make IntelMQ a Linux only product, which would be a shame given the overall open and portable design of it.
keeping it more portable only makes sense if there are people trying to run and helping to maintain intelmq on a non GNU/Linux platform. Otherwise it is just extra work and code that gets never executed. And it would be readded later again if a maintainer comes up.
Otherwise I'd like to see what systemd actually brings to the table. If it does add much, it may just be considered the less attractive solution and should not be used at all. (Except for starting the top-level intelmq-system, whatever it is.)
[..]
Sooner or later I guess intelmq will need this kind of "flow control" to be able to fulfill its promise of providing a fast and fully automatable system. So it may become interesting to you at certat as well. :)
currently the problems we are observing are quite fundamental and don't need overly clever solutions. I'd like to point to the proposal Bernard Herzog made in issue 709 last year: https://github.com/certtools/intelmq/issues/709 it outlines a rather simple solution to much of the resource problems, and demonstrates how to build solutions that don't depend on an higher level service, with in depth knowledge of the bots interactions.
Thanks for reminding about Bernhard Herzog's proposal. This again shows me that we need an overview of the problems we want to solve, before we can evaluate the different ways of solving them.
Regards, Bernhard