Re: [IntelMQ-dev] [IntelMQ-users] IEP03: IntelMQ Data Format - Multiple Values

31 Mar 2021


      Hello,
again a few notes based on Idea experience. :)
...
From: Sebastian Waldbauer waldbauer@cert.at, Date: bře 30, 2021
## Use-cases
### Network information
IntelMQ's format currently allows for *exactly one* value per field. For
example, every event can have *one* `source.ip` and *one* `source.fqdn`. In
some use-cases, multiple values can be useful, for example when querying DNS
information. One domain (`source.fqdn`) can point to multiple IP addresses
(`source.ip`). The other way round, multiple domains point to the same IP
address is also very common. The use-case first appeared was that one IP
address can be part of multiple Autonomous systems (`source.asn`).[1][2][3]
Do all source.fgdn have to correspond with source.ip and source.asn?
Consider:
source.ip: [78.128.216.141, 2001:718:ff05:202::141, 78.128.211.46, 2001:718:1:1f:50:56ff:feee:46]
   source.fqdn: [idea.cesnet.cz, www.cesnet.cz, cesnet.cz]
Relation of which IPs correspond to which FQDNs is lost here.
If it's not to be lost, you need another level of nesting/indirection -
or you can _require_ for all fields to correspond, split events accordingly
where it's not the case and implement both this and also variation of IEP04
(where you may face cartesian explosion problem I mentioned in reaction
there). Like something akin to:
Event 1
   meta.uuid.current: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
   source.ip: [78.128.216.141, 2001:718:ff05:202::141]
   source.fqdn: [idea.cesnet.cz]
Event 2
   meta.uuid.current: bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb
   meta.uuid.parent: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
   source.ip: [78.128.211.46, 2001:718:1:1f:50:56ff:feee:46]
   source.fqdn: [www.cesnet.cz, cesnet.cz]
...
### Classification
...
...
## Format
{"classification.taxonomy": ["information-content-security", "fraud"],
"classification.type": ["unauthorised-modification-of-information",
"phishing"]
I believe (feel free to correct me) that RSIT does not preclude usage of
just first level category in cases where second level is ambiguous or
unknown, so in two array format you could solve it for example like:
{
      "classification.taxonomy": ["information-content-security", "fraud"],
      "classification.type": [null, "phishing"]
In Idea we went for "merged" field, here it might look like:
classification: [
      "information-content-security.unauthorised-modification-of-information",
      "fraud.phishing"
   ]
or considering missing second level:
classification: [
      "information-content-security",
      "fraud.phishing"
   ]
...
### Optional back-conversion ("value-explosion")
One variant/option of this IEP is to create a conversion layer from the new
multi-value format to the old one-value format by creating multiple events
with only one value per field. Using this conversion, compatibility with
external components can be kept, while the advantages only exist inside the
IntelMQ core (ie. the bots).
Examples:
{"source.ip": ["127.0.0.1", "127.0.0.2"], "source.fqdn": ["example.com"]}
    -> {"source.ip": "127.0.0.1", "source.fqdn": ["example.com"]},
{"source.ip": "127.0.0.2", "source.fqdn": ["example.com"]}
{"source.ip": ["127.0.0.1", "127.0.0.2"], "source.fqdn": ["example.com",
"example.org"]}
    -> {"source.ip": "127.0.0.1", "source.fqdn": "example.com"},
{"source.ip": "127.0.0.1", "source.fqdn": "example.org"}, {"source.ip":
"127.0.0.2", "source.fqdn": "example.com"}, {"source.ip": "127.0.0.2",
"source.fqdn": "example.org"}
Ah, here goes cartesian. :) In theory this could work. In reality - don't
do that. We tried. Soon somebody starts to use multiple values for scans and
DDoSes, and you really do not want to grind the processing machine to the
halt when creating specific event for 200 source ips times 150 fqdns times
400 target ips times 350 fqdns times 50 asns, times ...
   This goes to too big numbers too fast.
...
IntelMQ followed the KISS ("keep it simple, stupid")[4] principle from its
beginning. It is disputable if multiple values breaks with this principle.
Depends on usecase - you might decide against multivalues just because
majority of IntelMQ users and use-cases does not need it and does weigh over
complexity increase. We had to bite the bullet, because we have a number of
our own sources of data, which are inherently M:N, YMMV.
...
## Alternatives
An alternative to using multiple values per field is to set unique
identifiers (e.g. UUID) per event and let events with the same origin have
the same "parent" identifier. This way, related events can be linked and
compatibility is easier. Relating the events to each other requires extra
steps although, but keeps the KISS principle. This approach will be
described in IEP04.
Complications are even here - how long should reader wait for possible
child events? How does it know it has a complete set, before processing it
and/or sending it forward?
...
## Other IoC processing formats
For reference, we describe the formats of other IoC-processing systems similar
to IntelMQ. Both formats, IDEA and n6 do support multiple values in different
kinds. If you know of other similar formats supporting multiple values, please
speak up!
As Idea is loosely based on IDMEF, I've been contacted by Prelude SIEM
guys, who are trying to do similar things at: https://www.secef.net/
   Haven't had time to review their work though.
Cheers
-- Pavel Kácha, CESNET

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

Re: [IntelMQ-dev] [IntelMQ-users] IEP03: IntelMQ Data Format - Multiple Values