Hello,
again a few notes based on Idea experience. :)
From: Sebastian Waldbauer waldbauer@cert.at, Date: bře 30, 2021
## Use-cases ### Network information IntelMQ's format currently allows for *exactly one* value per field. For example, every event can have *one* `source.ip` and *one* `source.fqdn`. In some use-cases, multiple values can be useful, for example when querying DNS information. One domain (`source.fqdn`) can point to multiple IP addresses (`source.ip`). The other way round, multiple domains point to the same IP address is also very common. The use-case first appeared was that one IP address can be part of multiple Autonomous systems (`source.asn`).[1][2][3]
Do all source.fgdn have to correspond with source.ip and source.asn?
Consider:
source.ip: [78.128.216.141, 2001:718:ff05:202::141, 78.128.211.46, 2001:718:1:1f:50:56ff:feee:46] source.fqdn: [idea.cesnet.cz, www.cesnet.cz, cesnet.cz]
Relation of which IPs correspond to which FQDNs is lost here.
If it's not to be lost, you need another level of nesting/indirection - or you can _require_ for all fields to correspond, split events accordingly where it's not the case and implement both this and also variation of IEP04 (where you may face cartesian explosion problem I mentioned in reaction there). Like something akin to:
Event 1 meta.uuid.current: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa source.ip: [78.128.216.141, 2001:718:ff05:202::141] source.fqdn: [idea.cesnet.cz]
Event 2 meta.uuid.current: bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb meta.uuid.parent: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa source.ip: [78.128.211.46, 2001:718:1:1f:50:56ff:feee:46] source.fqdn: [www.cesnet.cz, cesnet.cz]
### Classification
...
## Format {"classification.taxonomy": ["information-content-security", "fraud"], "classification.type": ["unauthorised-modification-of-information", "phishing"]
I believe (feel free to correct me) that RSIT does not preclude usage of just first level category in cases where second level is ambiguous or unknown, so in two array format you could solve it for example like:
{ "classification.taxonomy": ["information-content-security", "fraud"], "classification.type": [null, "phishing"]
In Idea we went for "merged" field, here it might look like:
classification: [ "information-content-security.unauthorised-modification-of-information", "fraud.phishing" ]
or considering missing second level:
classification: [ "information-content-security", "fraud.phishing" ]
### Optional back-conversion ("value-explosion")
One variant/option of this IEP is to create a conversion layer from the new multi-value format to the old one-value format by creating multiple events with only one value per field. Using this conversion, compatibility with external components can be kept, while the advantages only exist inside the IntelMQ core (ie. the bots).
Examples: {"source.ip": ["127.0.0.1", "127.0.0.2"], "source.fqdn": ["example.com"]} -> {"source.ip": "127.0.0.1", "source.fqdn": ["example.com"]}, {"source.ip": "127.0.0.2", "source.fqdn": ["example.com"]} {"source.ip": ["127.0.0.1", "127.0.0.2"], "source.fqdn": ["example.com", "example.org"]} -> {"source.ip": "127.0.0.1", "source.fqdn": "example.com"}, {"source.ip": "127.0.0.1", "source.fqdn": "example.org"}, {"source.ip": "127.0.0.2", "source.fqdn": "example.com"}, {"source.ip": "127.0.0.2", "source.fqdn": "example.org"}
Ah, here goes cartesian. :) In theory this could work. In reality - don't do that. We tried. Soon somebody starts to use multiple values for scans and DDoSes, and you really do not want to grind the processing machine to the halt when creating specific event for 200 source ips times 150 fqdns times 400 target ips times 350 fqdns times 50 asns, times ... This goes to too big numbers too fast.
IntelMQ followed the KISS ("keep it simple, stupid")[4] principle from its beginning. It is disputable if multiple values breaks with this principle.
Depends on usecase - you might decide against multivalues just because majority of IntelMQ users and use-cases does not need it and does weigh over complexity increase. We had to bite the bullet, because we have a number of our own sources of data, which are inherently M:N, YMMV.
## Alternatives
An alternative to using multiple values per field is to set unique identifiers (e.g. UUID) per event and let events with the same origin have the same "parent" identifier. This way, related events can be linked and compatibility is easier. Relating the events to each other requires extra steps although, but keeps the KISS principle. This approach will be described in IEP04.
Complications are even here - how long should reader wait for possible child events? How does it know it has a complete set, before processing it and/or sending it forward?
## Other IoC processing formats
For reference, we describe the formats of other IoC-processing systems similar to IntelMQ. Both formats, IDEA and n6 do support multiple values in different kinds. If you know of other similar formats supporting multiple values, please speak up!
As Idea is loosely based on IDMEF, I've been contacted by Prelude SIEM guys, who are trying to do similar things at: https://www.secef.net/ Haven't had time to review their work though.
Cheers -- Pavel Kácha, CESNET