Dear IntelMQ developers and users,
below are a couple of ideas how to (hopefully) make configuration of IntelMQ easier. Feel free to give feedback, voice concerns or simply ask if there is something unclear. We plan to evaluate the feedback that emerged in two weeks (after the christmas holidays).
# IntelMQ Configuration Handling (IntelMQ Enhancement Proposal 01)
## Format
### JSON
At the moment, the configuration format of IntelMQ is JSON[^1]. It is parsed using the Python json library, which is part of the Python Standard Library. The downside of JSON is, that is is hard to read and and write for humans and it cannot contain comments.
[^1]: https://docs.python.org/3/library/index.html
### YAML
There is a proposal[^2] to use YAML as the default configuration format. YAML provides way better readability for humans and YAML supports single line comments. There are two Python YAML libraries out there, the one being PyYAML[^3] and the other being ruamel.yaml[^4]. The former is a project by the YAML project itself. The latter is a fork of the former and had much more activity over the years and better support of the standard. It seems that pyyaml caught up in the last few years. We don't need any edge cases, so both libraries would be good for configuration files. According to this issue[^5] pyyaml does not support “editing YAML whilst maintaining comments”, which might be a deal breaker, but this issue is from 2016, this might have changed. On the other hand, IntelMQ does not edit configuration at the moment. pyyaml and ruamel.yaml are available as package in all relevant Linux distributions.
[^2]: https://github.com/gethvi/intelmq/blob/ideas/docs/Ideas.md#changing-configur... [^3]: https://pyyaml.org/ [^4]: https://yaml.readthedocs.io/en/latest/ [^5]: https://github.com/yaml/pyyaml/issues/46
### INI
The Python Standard Library also ships configparser[^6], which is a “configuration language which provides a structure similar to what’s found in Microsoft Windows INI files”. The files can contain comments, it comes with a [DEFAULT] section, which can be used for default values and the configuration files can contain variables. One downside is that all the configurations are Strings, which means we would have to do parsing ourself.
[^6]: https://docs.python.org/3/library/configparser.html
### toml
Tom's Obvious, Minimal Language is another contender for the role of IntelMQs configuration file format. It looks similar to the INI file format, but comes with various data types. It also allows comments. There is a Python library[^7] that seems to be very active. toml is also used as the format for the proposed pyproject.toml file and by the rust community for their package configuration files. toml's syntax for dictionaries is hard to read/write, harder than with JSON.
[^7]: https://pypi.org/project/toml/
### Further information
* The summary on file formats on the PEP518 proposition: https://www.python.org/dev/peps/pep-0518/#other-file-formats * At the moment we are leaning towards YAML. Regarding the library, we would choose ruamel.yaml, because it seems to have a more active upstream and it can retain comments when it modifies a yaml file.
## Storage
This part is about the question where do we store the configuration?.
The ideas document[^8] on GitHub already proposes to remove the pipeline.conf and specifying the destination pipelines in the individual bot configuration part. The declaration of the source queue can be dropped then as well, as it follows a rule anyway.
In addition to that, to make the setup of IntelMQ easier, the defaults.conf should be dropped. Default values should be set in the Bot classes respectively in the IntelMQ process managers, but there is no need for a separate file.
Another question is, if every bot should have their own configuration file. Some users wish to be able to start a bot without having to rely on IntelMQ, but at the moment, the bot gets the configuration from IntelMQ's runtime.conf. If we want to support the request to be able to pass individual configurations to bots, we could allow users to pass a separate configuration file to the bot (i.e. using `-c /path/to/config.$ext`). If that file is not set or does not contain the bots id, it is ignored and IntelMQ's runtime.conf is used as usual. If it does exists, the global runtime.conf is still parsed (if it exists - it should also be possible to run a bot without a runtime.conf) but only the values that are not set in the individual configuration file are considered. This individual configuration file would also allow a bot to be run in a docker environment without having to set any environment variables. This would make configuration handling probably easier, because then configuration settings could be stored in a file (and managed by a configuration management system) and the configuration file could contain comments.
Proposal:
* IntelMQ gets one global configuration file for all the bots and the pipeline.conf will be removed * This global configuration file is `${PREFIX}/etc/intelmq/intelmq.$ext`. If it does not exists or does not define any bots, IntelMQ should exit gracefully. The file extension depends on the chosen format. * The global configuration file contains an array of bot configurations with bot-ids as keys. * Every bot reads the global configuration file and extracts their own settings (as usual). * Every bot handles 0 to n `-c /path/to/configurationfile.$ext` flags, which are treated the same way as the global configuration file. The further ahead the configuration file in the commandline, the stronger the content (this allows us to have multiple non-global configuration files (i.e. for multiple groups)) Example: ```
botcommand bot-id -c /etc/bots/botname.$ext -c
/etc/bots/groups/group_foo.$ext ``` * Every bot also consults the environment and the values that are set their overwrite the values in any configuration file
* There are also configuration files which list settings that are not bot specific, i.e. via a reserved key default (successor of the defaults.conf file) or group:id, those are also handled like other configuration files, but the bot does not compare its name to the key of the configuration.
All the evaluated configuration formats provide the possibility to arrange the configuration parameters in hierarchies. To make the configuration files more readable, IntelMQ should make use of this hierarchy instead of denoting the different hierarchy levels with underscores. So instead of writing `http_proxy` the http parameter would have a childparameter proxy. For backwards compatibility and cases where the underscore does not imply hierarchy, the underscore notation will still work. In addition, IntelMQ should also make use of environment variables - those are still denoted using an underscore as delimiter and are prepended with `INTELMQ`: `INTELMQ_HTTP_PROXY`.
[^8]: https://github.com/gethvi/intelmq/blob/ideas/docs/Ideas.md
### Caveats
There are configuration settings, that do not really concern the bot- for example the type of process manager, that should be used to run the bot. In an ideal setup, the bot should be totally indifferent as to if it runs in a Docker container, on bare metal, in a SystemD unit file or with SupervisorD. This decision should only concern the tool managing all the bots (intelmqctl or in the future intelmq-api (which at the moment uses intelmqctl)). Another example is the enabled setting. At the moment, those are part of the individual bot configuration, but it might make sense to move them to a management.conf configuration file which is only for managing the individual bots, but not for configuring their parameters (this file would then also (for every bot) have a field that lists the configuration files the bot should consider when reading its configuration). On the other hand, this might make the configuration more complex again, now that we are trying to merge pipeline.conf and runtime.conf. We could also decide to make those configuration settings be part of the global configuration file, given that the individual bots should anyway simply ignore settings they do not know how to handle.
### Overriding by command line parameters
If needed, a user can override specific bot settings using the -p switch (i.e. `-p redis_cache=example.com`). This should be easy to implement, in the best case scenario this is only one line of additional code in the Bot class.
### Examples
A global configuration file with multiple bots /etc/intelmq/intelmq.yml
``` - shodan1: module: intelmq.bots.collectors.shodan.collector - mylittlebot23: module: intelmq.bots.expert.asn_lookup.expert http: proxy: http://myproxy.tld:80 - fop1: module: intelmq.bots.outputs.file output: filename: /dev/null ```
We can run a bot with intelmq-bot shodan1 which is the same as `intelmq-bot shodan1 -c /etc/intelmq/intelmq.yml`
Another configuration file with multiple bots /root/intelmq-bots-managed-by-root:
``` - shodan2: module: intelmq.bots.collectors.shodan.collector - fop1: module: intelmq.bots.outputs.file output: filename: /var/log/fop1.log ```
We can run a bot with `intelmq-bot shodan2 -c /root/intelmq-bots-managed-by-root`; We can run a bot using `intelmq-bot fop1 -c /root/intelmq-bots-managed-by-root` which would then send output to `/var/log/fop1.log`.
A configuration for a group in /etc/intelmq/collector-group.yml
``` - group:collectors http: proxy: http://thirdparty.proxy.tld:9000 ```
We can run a bot with intelmq-bot `mylittlebot23 -c /etc/intelmq/collector-group.yml` which uses the third-party proxy.
## Internal handling
Every bot class defines their own settings as class variables. Every class variable has to be typed. Every class variable should be set to a reasonable default, otherwise None. The init of the (abstract) Bot class should load all the relevant configuration files and then overwrite the settings. If a setting is still None and the value of the setting is vital for the functionality of the bot, the bot should stop and emit a meaningful error message. For the most common types of settings, there should be Python objects to check the values. Value checking should only be done after all the configurations are merged.
Hi Birger,
Am Donnerstag 10 Dezember 2020 13:17:45 schrieb Birger Schacht:
# IntelMQ Configuration Handling (IntelMQ Enhancement Proposal 01) ## Format
The downside of JSON is, that is is hard to read and and write for humans and it cannot contain comments.
JSON can contain comments, if they are part of the data itself, e.g. { "parameter1": true, "parameter1-comment": "better to have this enabled ;)" } or [ { "comment":"this really should be considered 2020-12-14ber", "param1":false }, { "param2":"Bernhard", "comment:"my name in 2020" } ]
this is a good thing, if data is to be handled mainly by tools and frontends. Because otherwise the comments are not accessible or visible there.
It is a drawback for workflows where text files are mainly worked upon manually, saved, delopyed and diffed with SCM like tools or text editors.
So the question behind this is the weight of the different use cases. Personally I'll find the wireing of a graph easier in an editor and IntelMQ Manager will certainly used by a number of people for this. So JSON is a good, modern fit for IntelMQ. I also consider it okay to write with a text editor.
_If_ there is a different format to be chosen because of the sum of the weight of the text file and out of band comments use cases, I advise against YAML. YAML is too complicated, which makes it hard to write and parse correctly by tool and humans. (The Python proposal lists the problems with the format.) Of all presented options (YAML,INI,TOML) I'd go with TOML.
Best Regards, Bernhard
Hi everyone,
So the question behind this is the weight of the different use cases. Personally I'll find the wireing of a graph easier in an editor and IntelMQ Manager will certainly used by a number of people for this. So JSON is a good, modern fit for IntelMQ. I also consider it okay to write with a text editor.
I believe people using IntelMQ Manager for wiring the graph don't really need to care what kind of configuration format is used. And for those (like me) using the text editor YAML feels superior.
_If_ there is a different format to be chosen because of the sum of the weight of the text file and out of band comments use cases, I advise against YAML. YAML is too complicated, which makes it hard to write and parse correctly by tool and humans. (The Python proposal lists the problems with the format.) Of all presented options (YAML,INI,TOML) I'd go with TOML.
I agree that YAML is complicated and it's features introduced security issues in the past. I would like to suggest strictyaml for consideration ( https://hitchdev.com/strictyaml/ ) which removes a lot of the complicated stuff and preserves comments across read/write operations. The project seems active. For me YAML is much easier to read and write than JSON or TOML, but this is obviously a very subjective matter.
As for the Storage proposal, I generally like it. With the multiple "-c" option it might get confusing really quickly so I would also suggest an option for printing the final configuration.
Best Regards,
Filip Pokorny CSIRT.CZ
On 12/14/20 10:42 AM, Bernhard Reiter wrote:
Hi Birger,
Am Donnerstag 10 Dezember 2020 13:17:45 schrieb Birger Schacht:
# IntelMQ Configuration Handling (IntelMQ Enhancement Proposal 01) ## Format
The downside of JSON is, that is is hard to read and and write for humans and it cannot contain comments.
JSON can contain comments, if they are part of the data itself, e.g. { "parameter1": true, "parameter1-comment": "better to have this enabled ;)" } or [ { "comment":"this really should be considered 2020-12-14ber", "param1":false }, { "param2":"Bernhard", "comment:"my name in 2020" } ]
this is a good thing, if data is to be handled mainly by tools and frontends. Because otherwise the comments are not accessible or visible there.
It is a drawback for workflows where text files are mainly worked upon manually, saved, delopyed and diffed with SCM like tools or text editors.
So the question behind this is the weight of the different use cases. Personally I'll find the wireing of a graph easier in an editor and IntelMQ Manager will certainly used by a number of people for this. So JSON is a good, modern fit for IntelMQ. I also consider it okay to write with a text editor.
_If_ there is a different format to be chosen because of the sum of the weight of the text file and out of band comments use cases, I advise against YAML. YAML is too complicated, which makes it hard to write and parse correctly by tool and humans. (The Python proposal lists the problems with the format.) Of all presented options (YAML,INI,TOML) I'd go with TOML.
Best Regards, Bernhard
IntelMQ-dev mailing list IntelMQ-dev@lists.cert.at https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
On 15.12.2020, at 17:12, Filip Pokorny filip.pokorny@csirt.cz wrote:
Signed PGP part Hi everyone,
So the question behind this is the weight of the different use cases. Personally I'll find the wireing of a graph easier in an editor and IntelMQ Manager will certainly used by a number of people for this. So JSON is a good, modern fit for IntelMQ. I also consider it okay to write with a text editor.
I believe people using IntelMQ Manager for wiring the graph don't really need to care what kind of configuration format is used. And for those (like me) using the text editor YAML feels superior.
I agree here with Filip.
The point I think is that the current config is way too complex and not even intrinsically allows for (natural) comments. YAML would. It would be one particular way of expressing meaning and comments and instructions to people editing config files. JSON comments as described are a bit of a clutch/work-around.
I just want to add my main point: I don't really care which config language is used (YAML, etc) as long as we
* reduce complexity (i.e. I don't want to read a very long config file) * follow the KISS principle (keep it simple and stupid) * we can explain things there via comments.
So, YAML is just an idea. We could also go for something else.
But in the past, the very long JSON (no comments) format has been , well.. a bit cumbersome.
How do other projects do their config language for large and complex configs? conf.d/ style directories are one way to address the problem Any other ideas on how to reduce complexity?
Best, a.
_If_ there is a different format to be chosen because of the sum of the weight of the text file and out of band comments use cases, I advise against YAML. YAML is too complicated, which makes it hard to write and parse correctly by tool and humans. (The Python proposal lists the problems with the format.) Of all presented options (YAML,INI,TOML) I'd go with TOML.
I agree that YAML is complicated and it's features introduced security issues in the past. I would like to suggest strictyaml for consideration ( https://hitchdev.com/strictyaml/ ) which removes a lot of the complicated stuff and preserves comments across read/write operations. The project seems active. For me YAML is much easier to read and write than JSON or TOML, but this is obviously a very subjective matter.
As for the Storage proposal, I generally like it. With the multiple "-c" option it might get confusing really quickly so I would also suggest an option for printing the final configuration.
Best Regards,
Filip Pokorny CSIRT.CZ
On 12/14/20 10:42 AM, Bernhard Reiter wrote:
Hi Birger,
Am Donnerstag 10 Dezember 2020 13:17:45 schrieb Birger Schacht:
# IntelMQ Configuration Handling (IntelMQ Enhancement Proposal 01) ## Format
The downside of JSON is, that is is hard to read and and write for humans and it cannot contain comments.
JSON can contain comments, if they are part of the data itself, e.g. { "parameter1": true, "parameter1-comment": "better to have this enabled ;)" } or [ { "comment":"this really should be considered 2020-12-14ber", "param1":false }, { "param2":"Bernhard", "comment:"my name in 2020" } ]
this is a good thing, if data is to be handled mainly by tools and frontends. Because otherwise the comments are not accessible or visible there.
It is a drawback for workflows where text files are mainly worked upon manually, saved, delopyed and diffed with SCM like tools or text editors.
So the question behind this is the weight of the different use cases. Personally I'll find the wireing of a graph easier in an editor and IntelMQ Manager will certainly used by a number of people for this. So JSON is a good, modern fit for IntelMQ. I also consider it okay to write with a text editor.
_If_ there is a different format to be chosen because of the sum of the weight of the text file and out of band comments use cases, I advise against YAML. YAML is too complicated, which makes it hard to write and parse correctly by tool and humans. (The Python proposal lists the problems with the format.) Of all presented options (YAML,INI,TOML) I'd go with TOML.
Best Regards, Bernhard
IntelMQ-dev mailing list IntelMQ-dev@lists.cert.at https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
<OpenPGP_0x8C1607AE1371C607.asc>
Hi Filip,
thanks for sharing your viewpoint!
Am Dienstag 15 Dezember 2020 17:12:41 schrieb Filip Pokorny:
I believe people using IntelMQ Manager for wiring the graph don't really need to care what kind of configuration format is used.
To clarify: The problem I see with some configuration formats are the out-of-band comments that would not be visible in IntelMQ Manager, so the Manager would be a second class. And if there is meaning in the comments, they cannot be semantically diffed.
The out-of-band comments can be part of important use cases (like saving different configurations in Mercurial SCM or so). In order to suggest something I'd personally would need to explore and understand these use cases in more detail.
Best Regards, Bernhard
On 15.12.2020, at 17:52, Bernhard Reiter bernhard@intevation.de wrote:
Signed PGP part Hi Filip,
thanks for sharing your viewpoint!
Am Dienstag 15 Dezember 2020 17:12:41 schrieb Filip Pokorny:
I believe people using IntelMQ Manager for wiring the graph don't really need to care what kind of configuration format is used.
To clarify: The problem I see with some configuration formats are the out-of-band comments that would not be visible in IntelMQ Manager, so the Manager would be a second class. And if there is meaning in the comments, they cannot be semantically diffed.
Ah! Yeah, that makes sense. I guess the whole discussion is a actually a discussion of usability from two different view points: command line config file editing versus intelmq-manager.
If some form of YAML were used, would the manager be able to pull in the (there it is structured / syntactically well defined) comments from there and display them?
The out-of-band comments can be part of important use cases (like saving different configurations in Mercurial SCM or so). In order to suggest something I'd personally would need to explore and understand these use cases in more detail.
Best Regards, Bernhard
-- www.intevation.de/~bernhard +49 541 33 508 3-3 Intevation GmbH, Osnabrück, DE; Amtsgericht Osnabrück, HRB 18998 Geschäftsführer Frank Koormann, Bernhard Reiter, Dr. Jan-Oliver Wagner
Hi Bernard,
thanks for your clarification.
I believe I understand the problem you present, but I still think the benefits outweigh it. After all this is still an overall improvement, it makes text editing easier and allows for comments (when text editing). It doesn't improve GUI approach via Manager, but it doesn't break it or remove functionality either. This whole proposed change is aimed at making the manual text editing easier. And as long as changing the configuration (containing comments) using IntelMQ Manager preserve the comments (which should be possible) I do not see any downsides to this. Out-of-band comments would remain intact for those important use-cases. I do agree that Manager would be second class because it couldn't see the comments, but frankly this is never going to be fair fight (cli vs gui), both have their pros and cons. And the only "con" this change introduces for Manager is that text editing gets more new "pros". Which doesn't seem like a deal breaker to me.
Best Regards, Filip
On 12/15/20 5:52 PM, Bernhard Reiter wrote:
Hi Filip,
thanks for sharing your viewpoint!
Am Dienstag 15 Dezember 2020 17:12:41 schrieb Filip Pokorny:
I believe people using IntelMQ Manager for wiring the graph don't really need to care what kind of configuration format is used.
To clarify: The problem I see with some configuration formats are the out-of-band comments that would not be visible in IntelMQ Manager, so the Manager would be a second class. And if there is meaning in the comments, they cannot be semantically diffed.
The out-of-band comments can be part of important use cases (like saving different configurations in Mercurial SCM or so). In order to suggest something I'd personally would need to explore and understand these use cases in more detail.
Best Regards, Bernhard
IntelMQ-dev mailing list IntelMQ-dev@lists.cert.at https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
Hi Filip,
thanks for the discussion, I'll answer you and Aaron in one go:
Am Dienstag 15 Dezember 2020 19:43:25 schrieb Filip Pokorny:
It doesn't improve GUI approach via Manager, but it doesn't break it or remove functionality either.
I think it would in a subtle way. (My aim is to point that way out first, not to recommend a decision in one or the other way.)
Usually an out-of-band comment has interesting info (otherwise it would be senseless) and the info is related to the values nearby.
So if the gui (editor) changes the value and cannot see the out-of-band comment, but preserves it, the comment will not match the value anymore and that the comment will be potentially broken. So one system will be the leading system unless all comments are in-band.
Am Dienstag 15 Dezember 2020 17:25:21 schrieb L. Aaron Kaplan:
But in the past, the very long JSON (no comments) format has been, well.. a bit cumbersome.
Optional in-band comments could be introduced. The default formatting could be made human readable and compact.
Otherwise the length maybe a problem of putting to much into one file?
How do other projects do their config language for large and complex configs?
I guess there is no silver bullet, each product will look at its requirements and use cases. Some put "config" data not into a "language", but consider it internal state that is managed by frontends.
Best Regards, Bernhard
Hi everyone,
thanks a lot for the input on the proposal, its really great such valuable feedback! And I think its a good sign that at least until now the format of the configuration file seems to be the most controversial part of the proposal ;)
I must admit I didn't really think about having access to the comments in the manager, but its a good point and I can see the value in that. I guess that comments being part of the data itself would be a solution that works for all the formats we listed, isn't it? We could simply introduce a "-comment" suffix that would work for every key. What I'm not sure about right now is how the respective Python libraries handle the ordering of keys. As Bernhard mentioned, the comment should be nearby the value it refers to. I guess the most common ordering would be alphabetical, so if we add a "-comment" suffix the comment would be listed below the value. We just have to make sure comment and value are kept together when the file is updated programmatically.
One other thing that might be relevant: I think the manager should not have to deal with the file-format IntelMQ uses internally. I think the communication between API and intelmq-manager should still use JSON, and the conversion should be done by the API, but I don't think that would be a problem.
cheers, Birger
On 12/18/20 6:45 PM, Bernhard Reiter wrote:
Hi Filip,
thanks for the discussion, I'll answer you and Aaron in one go:
Am Dienstag 15 Dezember 2020 19:43:25 schrieb Filip Pokorny:
It doesn't improve GUI approach via Manager, but it doesn't break it or remove functionality either.
I think it would in a subtle way. (My aim is to point that way out first, not to recommend a decision in one or the other way.)
Usually an out-of-band comment has interesting info (otherwise it would be senseless) and the info is related to the values nearby.
So if the gui (editor) changes the value and cannot see the out-of-band comment, but preserves it, the comment will not match the value anymore and that the comment will be potentially broken. So one system will be the leading system unless all comments are in-band.
Am Dienstag 15 Dezember 2020 17:25:21 schrieb L. Aaron Kaplan:
But in the past, the very long JSON (no comments) format has been, well.. a bit cumbersome.
Optional in-band comments could be introduced. The default formatting could be made human readable and compact.
Otherwise the length maybe a problem of putting to much into one file?
How do other projects do their config language for large and complex configs?
I guess there is no silver bullet, each product will look at its requirements and use cases. Some put "config" data not into a "language", but consider it internal state that is managed by frontends.
Best Regards, Bernhard
IntelMQ-dev mailing list IntelMQ-dev@lists.cert.at https://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
Hi all,
Thanks everybody for your valuable feedback to our proposal. If I may conclude, there were no objections on the sections "Storage" and "Internal handling". Either these proposals are overwhelmingly good or nobody dared to respond :)
Regarding the format I see that we have differing opinions, especially on the representation of comments in the UI. However, the discussion on this topic stalled without clear end. Let me summarize what the current situation is, as of IntelMQ 2.2.x + Manager with JSON as configuration format:
- tricky to edit directly because - JSON is a bit picky on it's syntax. E.g. in a list or dictionary there must not be a comma after the last element, which is nasty when adding, removing or rearranging parameters - JSON and its syntax are not meant for configuration, e.g. the adding of the syntax elements []{} can be nasty. - currently there is no way to add comments - JSON doesn't have comments by itself - IntelMQ + Manager don't support comments by itself either, even as data within JSON. E.g. by using special parameter names like "parameter1-comment" as Bernhard suggested.
And we have the two use-cases of editing via IntelMQ Manager and editing as text directly. Both ways are supported and should be possible in a reasonable way. By reducing the downsides of direct editing, we could make the life of various IntelMQ users easier.
Both TOML and YAML solve the problem of the tricky-to-edit format. YAML-libraries for Python also support comments which can be /preserved/, even when the file is edited by other means (intelmqctl as well as IntelMQ Manager).
If we choose TOML, and an IntelMQ user uses comments in the file, the comments /will be gone/ if either intelmqctl or IntelMQ Manager (resp. the API) changes the file. If we choose YAML, and an IntelMQ user uses comments in the file, the comments /will not be gone/ if intelmqctl changes the file. The IntelMQ Manager needs fixes as well to preserve comments[0], and showing them in the Manager could be implemented as well.
Then we have the issue of the complexity of TOML/YAML itself, compared to each other. Bernhard noted that YAML is too complex, while Aaron and Filip didn't share the opinion - please correct me if I'm wrong. Staying with JSON means that we have no comments at all, but the user can't even attempt to add comments. The complexity of parsing and writing for the tools is relatively small, as JSON is made for machine-readability.
As far as the discussion has gone so far, we have more "consent" for YAML, and less for TOML and leaving it as is. Please speak up if you think that my summary is wrong.
best regards, Sebastian
[0] Regarding the changes in the IntelMQ Manager Frontend, we (CERT.at) desire help from other community members to implement these features.
On 12/10/20 1:17 PM, Birger Schacht wrote:
Dear IntelMQ developers and users,
below are a couple of ideas how to (hopefully) make configuration of IntelMQ easier. Feel free to give feedback, voice concerns or simply ask if there is something unclear. We plan to evaluate the feedback that emerged in two weeks (after the christmas holidays).
# IntelMQ Configuration Handling (IntelMQ Enhancement Proposal 01)
## Format
### JSON
At the moment, the configuration format of IntelMQ is JSON[^1]. It is parsed using the Python json library, which is part of the Python Standard Library. The downside of JSON is, that is is hard to read and and write for humans and it cannot contain comments.
### YAML
There is a proposal[^2] to use YAML as the default configuration format. YAML provides way better readability for humans and YAML supports single line comments. There are two Python YAML libraries out there, the one being PyYAML[^3] and the other being ruamel.yaml[^4]. The former is a project by the YAML project itself. The latter is a fork of the former and had much more activity over the years and better support of the standard. It seems that pyyaml caught up in the last few years. We don't need any edge cases, so both libraries would be good for configuration files. According to this issue[^5] pyyaml does not support “editing YAML whilst maintaining comments”, which might be a deal breaker, but this issue is from 2016, this might have changed. On the other hand, IntelMQ does not edit configuration at the moment. pyyaml and ruamel.yaml are available as package in all relevant Linux distributions.
### INI
The Python Standard Library also ships configparser[^6], which is a “configuration language which provides a structure similar to what’s found in Microsoft Windows INI files”. The files can contain comments, it comes with a [DEFAULT] section, which can be used for default values and the configuration files can contain variables. One downside is that all the configurations are Strings, which means we would have to do parsing ourself.
### toml
Tom's Obvious, Minimal Language is another contender for the role of IntelMQs configuration file format. It looks similar to the INI file format, but comes with various data types. It also allows comments. There is a Python library[^7] that seems to be very active. toml is also used as the format for the proposed pyproject.toml file and by the rust community for their package configuration files. toml's syntax for dictionaries is hard to read/write, harder than with JSON.
### Further information
- The summary on file formats on the PEP518 proposition:
https://www.python.org/dev/peps/pep-0518/#other-file-formats
- At the moment we are leaning towards YAML. Regarding the library,
we would choose ruamel.yaml, because it seems to have a more active upstream and it can retain comments when it modifies a yaml file.
## Storage
This part is about the question where do we store the configuration?.
The ideas document[^8] on GitHub already proposes to remove the pipeline.conf and specifying the destination pipelines in the individual bot configuration part. The declaration of the source queue can be dropped then as well, as it follows a rule anyway.
In addition to that, to make the setup of IntelMQ easier, the defaults.conf should be dropped. Default values should be set in the Bot classes respectively in the IntelMQ process managers, but there is no need for a separate file.
Another question is, if every bot should have their own configuration file. Some users wish to be able to start a bot without having to rely on IntelMQ, but at the moment, the bot gets the configuration from IntelMQ's runtime.conf. If we want to support the request to be able to pass individual configurations to bots, we could allow users to pass a separate configuration file to the bot (i.e. using `-c /path/to/config.$ext`). If that file is not set or does not contain the bots id, it is ignored and IntelMQ's runtime.conf is used as usual. If it does exists, the global runtime.conf is still parsed (if it exists - it should also be possible to run a bot without a runtime.conf) but only the values that are not set in the individual configuration file are considered. This individual configuration file would also allow a bot to be run in a docker environment without having to set any environment variables. This would make configuration handling probably easier, because then configuration settings could be stored in a file (and managed by a configuration management system) and the configuration file could contain comments.
Proposal:
- IntelMQ gets one global configuration file for all the bots and
the pipeline.conf will be removed
- This global configuration file is
`${PREFIX}/etc/intelmq/intelmq.$ext`. If it does not exists or does not define any bots, IntelMQ should exit gracefully. The file extension depends on the chosen format.
- The global configuration file contains an array of bot
configurations with bot-ids as keys.
- Every bot reads the global configuration file and extracts their
own settings (as usual).
- Every bot handles 0 to n `-c /path/to/configurationfile.$ext`
flags, which are treated the same way as the global configuration file. The further ahead the configuration file in the commandline, the stronger the content (this allows us to have multiple non-global configuration files (i.e. for multiple groups)) Example: ``` > botcommand bot-id -c /etc/bots/botname.$ext -c /etc/bots/groups/group_foo.$ext ```
- Every bot also consults the environment and the values that are
set their overwrite the values in any configuration file
- There are also configuration files which list settings that are
not bot specific, i.e. via a reserved key default (successor of the defaults.conf file) or group:id, those are also handled like other configuration files, but the bot does not compare its name to the key of the configuration.
All the evaluated configuration formats provide the possibility to arrange the configuration parameters in hierarchies. To make the configuration files more readable, IntelMQ should make use of this hierarchy instead of denoting the different hierarchy levels with underscores. So instead of writing `http_proxy` the http parameter would have a childparameter proxy. For backwards compatibility and cases where the underscore does not imply hierarchy, the underscore notation will still work. In addition, IntelMQ should also make use of environment variables - those are still denoted using an underscore as delimiter and are prepended with `INTELMQ`: `INTELMQ_HTTP_PROXY`.
### Caveats
There are configuration settings, that do not really concern the bot- for example the type of process manager, that should be used to run the bot. In an ideal setup, the bot should be totally indifferent as to if it runs in a Docker container, on bare metal, in a SystemD unit file or with SupervisorD. This decision should only concern the tool managing all the bots (intelmqctl or in the future intelmq-api (which at the moment uses intelmqctl)). Another example is the enabled setting. At the moment, those are part of the individual bot configuration, but it might make sense to move them to a management.conf configuration file which is only for managing the individual bots, but not for configuring their parameters (this file would then also (for every bot) have a field that lists the configuration files the bot should consider when reading its configuration). On the other hand, this might make the configuration more complex again, now that we are trying to merge pipeline.conf and runtime.conf. We could also decide to make those configuration settings be part of the global configuration file, given that the individual bots should anyway simply ignore settings they do not know how to handle.
### Overriding by command line parameters
If needed, a user can override specific bot settings using the -p switch (i.e. `-p redis_cache=example.com`). This should be easy to implement, in the best case scenario this is only one line of additional code in the Bot class.
### Examples
A global configuration file with multiple bots /etc/intelmq/intelmq.yml
- shodan1: module: intelmq.bots.collectors.shodan.collector - mylittlebot23: module: intelmq.bots.expert.asn_lookup.expert http: proxy: http://myproxy.tld:80 - fop1: module: intelmq.bots.outputs.file output: filename: /dev/null
We can run a bot with intelmq-bot shodan1 which is the same as `intelmq-bot shodan1 -c /etc/intelmq/intelmq.yml`
Another configuration file with multiple bots /root/intelmq-bots-managed-by-root:
- shodan2: module: intelmq.bots.collectors.shodan.collector - fop1: module: intelmq.bots.outputs.file output: filename: /var/log/fop1.log
We can run a bot with `intelmq-bot shodan2 -c /root/intelmq-bots-managed-by-root`; We can run a bot using `intelmq-bot fop1 -c /root/intelmq-bots-managed-by-root` which would then send output to `/var/log/fop1.log`.
A configuration for a group in /etc/intelmq/collector-group.yml
- group:collectors http: proxy: http://thirdparty.proxy.tld:9000
We can run a bot with intelmq-bot `mylittlebot23 -c /etc/intelmq/collector-group.yml` which uses the third-party proxy.
## Internal handling
Every bot class defines their own settings as class variables. Every class variable has to be typed. Every class variable should be set to a reasonable default, otherwise None. The init of the (abstract) Bot class should load all the relevant configuration files and then overwrite the settings. If a setting is still None and the value of the setting is vital for the functionality of the bot, the bot should stop and emit a meaningful error message. For the most common types of settings, there should be Python objects to check the values. Value checking should only be done after all the configurations are merged.
Hi Sebastian,
Am Donnerstag 14 Januar 2021 16:51:37 schrieb Sebastian Wagner:
Either these proposals are overwhelmingly
to me they were overwhelmingly big, so I did not fully think them through.
YAML-libraries for Python also support comments which can be /preserved/, even when the file is edited by other means (intelmqctl as well as IntelMQ Manager).
But preserving comments without them being part of the structured data to be shown and edited by a tool (like the manager) means they are potentially wrong, or "broken", once the tool edits the parameters. This will make all tools be second citizens. Most of my mails were about explaining this, no consensus necessary here as it follows by the nature of the data by arguments.
A potential solution to this is to make comments part of the structured data (no matter what format is chosen).
Consensus could be build about if it is okay to live with the negative consequences of second class tools (that may break comment and value consistency when used). Or about if the drawbacks of a overcomplicated format are worth a faster typing in most situations. :)
For both questions I don't know enough about the user base of IntelMQ to have an informed opinion. (As a person doing service on the code I'd implement what users would like most.)
Best Regards, Bernhard
On 14.01.2021 16:51:37, Sebastian Wagner wrote:
As far as the discussion has gone so far, we have more "consent" for YAML, and less for TOML and leaving it as is. Please speak up if you think that my summary is wrong.
+1 for YAML
On 18.01.2021, at 10:06, Trey Darley trey.darley@cert.be wrote:
Signed PGP part On 14.01.2021 16:51:37, Sebastian Wagner wrote:
As far as the discussion has gone so far, we have more "consent" for YAML, and less for TOML and leaving it as is. Please speak up if you think that my summary is wrong.
+1 for YAML
+1
Hi,
Am Donnerstag 10 Dezember 2020 13:17:45 schrieb Birger Schacht:
This part is about the question where do we store the configuration?.
overall I do miss the use cases or problems that should be addressed by the proposed changes. Having a problem description and links to discussion that have already taken place, would make it easier to comment on the proposal.
Some relevant places that describe wishes, status and suggestions: https://intelmq.readthedocs.io/en/latest/user/bots.html#common-parameters https://intelmq.readthedocs.io/en/latest/user/configuration-management.html https://github.com/certtools/intelmq/issues/267 (Configurations - Hierarchy configurations) closed https://github.com/certtools/intelmq/issues/552 (Enable separate packaging of bots by allowing addition and removals to the config)
The ideas document[^8] on GitHub already proposes to remove the pipeline.conf and specifying the destination pipelines in the individual bot configuration part. The declaration of the source queue can be dropped then as well, as it follows a rule anyway.
The idea sounds useful, to decrease size of the configuration. (Making something easier to understand is always a use case.)
In addition to that, to make the setup of IntelMQ easier, the defaults.conf should be dropped. Default values should be set in the Bot classes respectively in the IntelMQ process managers, but there is no need for a separate file.
The default.conf seems to be used to offer a single place to change options shared by many bots (e.g. http_user_agent) at once. If options exist where a common value for a single installation and their bots is useful the functionality has to be kept somewhere central.
I understood the new plave for this would be in a global configuration file, which contains what default.conf had. This would just be a renaming if there weren't other things in the file.
The old pipeline.conf has the wireing, which has a effect which goes beyond one bot. As it connects bots, it maybe interessing to have in one place to check for consitency.
Another question is, if every bot should have their own configuration file.
What would be the use case for this? #552 packaging does not mandate this, if general default values are in the source code of bots. (It would mandate it, if bots had to come with an example config file to be useful.)
Again one aspect to look for can be what we want to do with the configuration files. One use case is: We want to check the whole configuration for consistency. For this it make sense that a lot of stuff is known about configuration parameters and to me the best way to specify this is as part of the source code of bots using Python code and type information. This way even more complex requirements for config values can be expressed using python functions and dynamic consistency check could use this code. Thus the code for a bot specific configuration parameters should be close to the bot itself. (And if their are parameters they share, it can be in the super class or abstract class, coming with IntelMQ (core).)
Okay, #552 would want a deinstallation method, which can be implemented against a joined configuration storage as well.
Some users wish to be able to start a bot without having to rely on IntelMQ,
Why? How can a bot with access to the IntelMQ queues be useful? I can imagine some janitor functionality, like freshing an external datasource format from time to time and this needs parameters that the real bot also needs. Anyhow could be seen as not being the bot itself, it would just be shared config values.
If parsing of the central intelmq storage would be in a library, then those assistent module could just read the config without starting or stopping other parts of IntelMQ.
If we want to support the request to be able to pass individual configurations to bots,
Why would I run a bot that affects the IntelMQ network to be run with different parameters? I have to make sure to stop the bot with the real parameters.
This individual configuration file would also allow a bot to be run in a docker environment without having to set any environment variables.
The bots would still have to access the commonly set parameters. Interlude: https://12factor.net/config believes that using ENVIRONMENT variables would be a good pattern for running application parts ("apps") in different containers. Wireing that happens outside of course. The idea is, if you need a different set of configuration, just fire up a container with it. (I am not necessarily convinced of this pattern, leading to this comment https://github.com/Intevation/intelmq-fody-backend/blob/ad7a88022bdeadf3461a... )
This would make configuration handling probably easier, because then configuration settings could be stored in a file (and managed by a configuration management system)
Several central configuration files could also be handled in an SCM. Of course, the diff for a single bot cannot be seen more easily, if it is just one file that is read.
Proposal:
- IntelMQ gets one global configuration file for all the bots and
the pipeline.conf will be removed
(Then it must have the default.conf possibilities.)
- Every bot handles 0 to n `-c /path/to/configurationfile.$ext`
flags, which are treated the same way as the global configuration file.
A complication I'd only do with a relevant use case.
- Every bot also consults the environment and the values that are
set their overwrite the values in any configuration file
Same here.
- There are also configuration files which list settings that are
not bot specific, i.e. via a reserved key default (successor of the defaults.conf file) or group:id, those are also handled like other configuration files, but the bot does not compare its name to the key of the configuration.
So additional default.conf files? (I guess I do not fully understand the idea.)
All the evaluated configuration formats provide the possibility to arrange the configuration parameters in hierarchies. To make the configuration files more readable
This seems part of the format discussion mostly. (A file per bot, saves one level in the file, making a single file easier to read.)
In an ideal setup, the bot should be totally indifferent as to if it runs in a Docker container, on bare metal, in a SystemD unit file or with SupervisorD.
I agree in principle. A potential solution is: the process manager could extract all the configuration settings and export them all in environment variables. This way the central configuration files (which were existing in all proposed variants) do not have to be shipped to the container, so filesystem access would not be mandatory, only access to redis and whatever other resources a bot needs.
Thinking about this, we could make a redis configuration / control queue and then bots would only need to connect to the queue system and then request their current configuration from there. (File that idea in folder *crazy*, it is getting close to end of business here. ;) )
Overall I've observed much good thinking while reading the storage part of the proposal part. The whole problem space does not really segments itself nicely in my head up to now, which is a sign that things are more involved than at first sight. Hope my mixture of questions and thoughts helps to make it better!
Best Regards, Bernhard
Hi.
On 1/22/21 5:26 PM, Bernhard Reiter wrote:
This part is about the question where do we store the configuration?.
overall I do miss the use cases or problems that should be addressed by the proposed changes. Having a problem description and links to discussion that have already taken place, would make it easier to comment on the proposal.
Some relevant places that describe wishes, status and suggestions: https://intelmq.readthedocs.io/en/latest/user/bots.html#common-parameters https://intelmq.readthedocs.io/en/latest/user/configuration-management.html https://github.com/certtools/intelmq/issues/267 (Configurations - Hierarchy configurations) closed https://github.com/certtools/intelmq/issues/552 (Enable separate packaging of bots by allowing addition and removals to the config)
Plus https://github.com/certtools/intelmq/issues/570 "configuration format" https://github.com/certtools/intelmq/issues/121 "Configuration Files" (closed but not implemented all ideas) https://github.com/certtools/intelmq/issues/1026 "Proposal: use template library for JSON configs" (not addressed by this proposal) https://github.com/certtools/intelmq/issues/1580 "Some parameters with default values throw AttributeError when not set" and related to the BOTS file: https://github.com/certtools/intelmq/issues/440 "Installing custom Bots" https://github.com/certtools/intelmq/issues/1646 "Run custom bot" https://github.com/certtools/intelmq/issues/552 "Enable separate packaging of bots by allowing addition and removals to the config." https://github.com/certtools/intelmq/issues/757 "Clearly define all parameters used in a bot" https://github.com/certtools/intelmq/issues/668 "Very long BOTS file" https://github.com/certtools/intelmq/issues/644 "Errors when already configured bots gain additional options through upgrade" https://github.com/certtools/intelmq/issues/908 "Parameter from BOTS does'nt passed to a new bot"
But non of them directly matches the proposal and most are addressed by the "Internal handling" section of the proposal. Our proposal is also based on the requirements collection last year and extended to match the behavior of other tools (`-c` parameter) or simply some handy usability tricks like setting parameters with `-p` (useful for debugging & testing). So, besides the examples given or linked in the proposal itself, there are not much more use-cases.
Our intention was as well to *start* a discussion by the proposal in the first place, but until now the discussion mainly focused on one aspect. One lesson learning on this is to split proposals into smaller parts, and not group them too much.
In addition to that, to make the setup of IntelMQ easier, the defaults.conf should be dropped. Default values should be set in the Bot classes respectively in the IntelMQ process managers, but there is no need for a separate file.
The default.conf seems to be used to offer a single place to change options shared by many bots (e.g. http_user_agent) at once. If options exist where a common value for a single installation and their bots is useful the functionality has to be kept somewhere central.
I understood the new plave for this would be in a global configuration file, which contains what default.conf had. This would just be a renaming if there weren't other things in the file.
It's more than renaming, it's also a cleanup. As the IntelMQ-default values go into the code, that file (or section in a file) only needs to carry those default values which are set by the administrator and differ from IntelMQ's defaults. So the default-files of most installations can be either dropped or will shrink significantly.
Another question is, if every bot should have their own configuration file.
What would be the use case for this? #552 packaging does not mandate this, if general default values are in the source code of bots. (It would mandate it, if bots had to come with an example config file to be useful.)
The question/proposal is based on a use-case identified by the requirements collection:
https://github.com/certtools/intelmq/blob/version-3.0-ideas/docs/architectur...
be on a per-program-basis (one config file per "bot"). The config
files per program shall reside in $base/etc/config.d/ and follow the common linux standards.
The proposal to use the -c parameter for this covers the use-case, but is more generic. For example it can be handy for Docker-setups as well, as described in the initial mail.
Again one aspect to look for can be what we want to do with the configuration files. One use case is: We want to check the whole configuration for consistency. For this it make sense that a lot of stuff is known about configuration parameters and to me the best way to specify this is as part of the source code of bots using Python code and type information. This way even more complex requirements for config values can be expressed using python functions and dynamic consistency check could use this code. Thus the code for a bot specific configuration parameters should be close to the bot itself.
Definitely. We thought about using variable typing for this, but haven't done PoCs yet. See section "Internal handling" of the proposal
(And if their are parameters they share, it can be in the super class or abstract class, coming with IntelMQ (core).)
For the CollectorBot and ParserBot classes, this is already the case. There's more potential, e.g. a HTTPBot class.
Some users wish to be able to start a bot without having to rely on IntelMQ,
Why? How can a bot with access to the IntelMQ queues be useful? I can imagine some janitor functionality, like freshing an external datasource format from time to time and this needs parameters that the real bot also needs. Anyhow could be seen as not being the bot itself, it would just be shared config values.
I don't have more details on this use-case. But this use-case is covered by the more generic idea to have a -c parameter to load configuration files.
If we want to support the request to be able to pass individual configurations to bots,
Why would I run a bot that affects the IntelMQ network to be run with different parameters? I have to make sure to stop the bot with the real parameters.
When running bots interactively for testing and debugging, this would be very handy. It's the operators responsibility to stop the bot, after starting it with deviating parameters.
This individual configuration file would also allow a bot to be run in a docker environment without having to set any environment variables.
The bots would still have to access the commonly set parameters.
Not if the commonly set parameters are included in that file, or if IntelMQ's defaults are ok.
Interlude: https://12factor.net/config believes that using ENVIRONMENT variables would be a good pattern for running application parts ("apps") in different containers. Wireing that happens outside of course. The idea is, if you need a different set of configuration, just fire up a container with it. (I am not necessarily convinced of this pattern, leading to this comment https://github.com/Intevation/intelmq-fody-backend/blob/ad7a88022bdeadf3461a... )
This is also the best practice for Docker, leading to this part of the proposal:
- Every bot also consults the environment and the values that are
set their overwrite the values in any configuration file
Same here.
The primary use-case here is Docker. In Docker the best-practice to pass configuration variables to containers are environment variables. This approach is partly used by the existing Docker image we created. For now, we only implemented this for redis_cache_host (https://github.com/certtools/intelmq/blob/develop/intelmq/lib/bot.py#L734-L7...) as bare minimum to be able to create the Docker image.
- There are also configuration files which list settings that are
not bot specific, i.e. via a reserved key default (successor of the defaults.conf file) or group:id, those are also handled like other configuration files, but the bot does not compare its name to the key of the configuration.
So additional default.conf files? (I guess I do not fully understand the idea.)
In order to get rid of the separate defaults.conf file, the proposal lists two solutions:
* the reserved key "default" (or similar). For example, the configuration file could look like this: ``` - shodan1: module: intelmq.bots.collectors.shodan.collector - mylittlebot23: module: intelmq.bots.expert.asn_lookup.expert http: proxy: http://myproxy.tld:80 - default: http: proxy: http://mydefault.proxy.intern:8080 ```
* The other *additional* solution are the group defaults. The example given in the proposal is: ``` - group:collectors http: proxy: http://thirdparty.proxy.tld:9000 ```
This would be a new feature and can be handy for e.g. rate_limit or error handling parameters
In an ideal setup, the bot should be totally indifferent as to if it runs in a Docker container, on bare metal, in a SystemD unit file or with SupervisorD.
I agree in principle. A potential solution is: the process manager could extract all the configuration settings and export them all in environment variables. This way the central configuration files (which were existing in all proposed variants) do not have to be shipped to the container, so filesystem access would not be mandatory, only access to redis and whatever other resources a bot needs.
That's actually one of the possibilities for deploying every bot in a single docker container and pass the parameters to the containers by the central orchestration component. However, this can be address later.
Thinking about this, we could make a redis configuration / control queue and then bots would only need to connect to the queue system and then request their current configuration from there. (File that idea in folder *crazy*, it is getting close to end of business here. ;) )
I wouldn't call it crazy, but radical.
Overall I've observed much good thinking while reading the storage part of the proposal part. The whole problem space does not really segments itself nicely in my head up to now, which is a sign that things are more involved than at first sight. Hope my mixture of questions and thoughts helps to make it better!
Thank you for all your valuable feedback, insights and thoughts. We are very thankful for your detailed responses!
best regards Sebastian