Data Harmonization - Fields with multiple values - IntelMQ-dev

List overview All Threads
Download

newer

Data Harmonization - Fields with multiple values

older

Regarding rpm packages for...

Bugfix release 1.0.2

Knight, Alexander

3 Nov 2017 3 Nov '17

6:26 a.m.

Hi All,

I am currently in the process of deciding whether ANZ should incorporate IntelMQ into its Threat Intelligence ingestion and sharing platform.

At the Deepsec conference Sebastian mentioned updating the harmonization to allow for fields with multiple values. Has this issue been progressed at all? We will require multiple values for some fields in our events, and I was considering adding this functionality (perhaps in a hacky way) to my own fork, but I would like an update on the progress on the work on the master before doing so.

Regards, Alex Knight | ANZ | ISO | Cyber Security Engineering Level 8, 55 Collins Street, Melbourne 3000 Phone: +61 386 545 888 | www.anz.comhttp://www.anz.com/

"This e-mail and any attachments to it (the "Communication") is, unless otherwise stated, confidential, may contain copyright material and is for the use only of the intended recipient. If you receive the Communication in error, please notify the sender immediately by return e-mail, delete the Communication and the return e-mail, and do not read, copy, retransmit or otherwise deal with it. Any views expressed in the Communication are those of the individual sender only, unless expressly stated to be those of Australia and New Zealand Banking Group Limited ABN 11 005 357 522, or any of its related entities including ANZ Bank New Zealand Limited (together "ANZ"). ANZ does not accept liability in connection with the integrity of or errors in the Communication, computer virus, data corruption, interference or delay arising from or in respect of the Communication."

Attachments:

attachment.html (text/html — 3.8 KB)

Show replies by date

Sebastian Wagner

8 Nov 8 Nov

12:59 p.m.

New subject: [Intelmq-dev] Data Harmonization - Fields with multiple values

Hi,

On 11/03/2017 06:26 AM, Knight, Alexander wrote:

...

At the Deepsec conference Sebastian mentioned updating the harmonization to allow for fields with multiple values. Has this issue been progressed at all?

The use case was the field abuse_contact which could be a list and then be concatenated (if necessary) with commas. Technically it is not hard to do it. In the develop branch I already have something similar (and more complex): a dictionary type named JSONDict. So, not directly, but some changes that should make a change easier.

There are some questions popping up that need to be clarified first: * How to define the types of the values inside the list? E.g. for the abuse_contact it has to be a list of strings/email addresses * How should the "API" look like, or in other words: what should happen for the in and setitem-operations etc * When should the list be converted to a string (or maybe also a JSON-list)? E.g. for postgres output the abuse_contact could either be a json-list or a comma separated list, depending on the table's definition, but for NoSQL-databases and files it can be just the list itself.

And: what use cases do we have? That's good to know before thinking about how we implement that all:

...

We will require multiple values for some fields in our events,

What is in these fields? (type and/or example values) Where do you put that that and how do you want to work with in (inside intelmq)?

I'd like to hear opinions of other users and developers too!

Sebastian P.S.: I do have specific ideas, but don't want to bias others ;)

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Knight, Alexander

9 Nov 9 Nov

5:05 a.m.

New subject: [Intelmq-dev] Data Harmonization - Fields with multiple values

Hi,

And: what use cases do we have? My particular use case at the moment is to have lists of IP addresses, IP networks and possibly FQDN's.

How to define the types of the values inside the list? The values will be those that conform to IPAddress, IPNetwork and FQDN for their respective type. It could be represented as a vertical bar or comma separated list within a string or it could be a proper python list.

How should the "API" look like The API should function as a regular python list. That being said, I don't imagine doing any complex operations with the list - I will have access to all the values within the parser and will be able to add them all to the event at once.

When should the list be converted to a string (or maybe also a JSON-list)? My main usage will be outputting the events to Mongo - in that case a JSON-list will work. But overall I am happy to use strings to represent the list for all outputs if it makes it easier. I can simply split the values out after receiving the event on the other end.

My end use case is marking up the events as indicators in STIX. One of the teams most vital sources will have many source IPs/Networks/FQDNs per indicator, and thus I would like to be able to send a list of these values as one event.

Regards, Alex

From: Sebastian Wagner [mailto:wagner@cert.at] Sent: Wednesday, 8 November 2017 10:59 PM To: Knight, Alexander; intelmq-dev@lists.cert.at Subject: Re: [Intelmq-dev] Data Harmonization - Fields with multiple values

Hi,

On 11/03/2017 06:26 AM, Knight, Alexander wrote:

At the Deepsec conference Sebastian mentioned updating the harmonization to allow for fields with multiple values. Has this issue been progressed at all? The use case was the field abuse_contact which could be a list and then be concatenated (if necessary) with commas. Technically it is not hard to do it. In the develop branch I already have something similar (and more complex): a dictionary type named JSONDict. So, not directly, but some changes that should make a change easier.

And: what use cases do we have? That's good to know before thinking about how we implement that all:

We will require multiple values for some fields in our events, What is in these fields? (type and/or example values) Where do you put that that and how do you want to work with in (inside intelmq)?

I'd like to hear opinions of other users and developers too!

Sebastian P.S.: I do have specific ideas, but don't want to bias others ;)

// Sebastian Wagner wagner@cert.at mailto:wagner@cert.at - T: +43 1 5056416 7201

// CERT Austria - https://www.cert.at/

// Eine Initiative der nic.at GmbH - https://www.nic.at/

// Firmenbuchnummer 172568b, LG Salzburg

Sebastian Wagner

1:37 p.m.

New subject: [Intelmq-dev] Data Harmonization - Fields with multiple values

Hi,

On 11/09/2017 05:05 AM, Knight, Alexander wrote:

...

...
And: what use cases do we have?

My particular use case at the moment is to have lists of IP addresses, IP networks and possibly FQDN’s.

As far as I know IntelMQ was not intended to be used like that (a design decision), but to have one singe source and destination per event. I hope Aaron can give more details on this. If there are more than one source, the event can be split, so you have two events, each with one source. Other formats like IDEA from warden do have this possibility.

Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

L. Aaron Kaplan

3:51 p.m.

New subject: [Intelmq-dev] Data Harmonization - Fields with multiple values

...

On 09 Nov 2017, at 13:37, Sebastian Wagner wagner@cert.at wrote:

Hi,

On 11/09/2017 05:05 AM, Knight, Alexander wrote:

...
...
And: what use cases do we have?

My particular use case at the moment is to have lists of IP addresses, IP networks and possibly FQDN’s.

As far as I know IntelMQ was not intended to be used like that (a design decision), but to have one singe source and destination per event. I hope Aaron can give more details on this.

Yes, indeed, Sebastian is correct here. To give some historic context to the discussion: when we (Tomas, me, Mauro, ...) started IntelMQ some years ago, we explicitly wanted to keep a very very simple format. In addition, we intentionally wanted to be as compatible as possible to the Abusehelper format. In fact, back then we documented the format of Abusehelper, for some weird non-native-english-speaker reason , named the format "Data harmonisation Ontology" (DHO) and made the first documentation on the Abusehelper wiki [1]. By now the Abusehelper DHO differs from IntelMQ's DHO :(

Part of that was to have the simple (KISS - keep it simple, stupid) principle of having *one event per IP address or fqdn**. Two different IPs? --> please make two events. Even though these events might be tied together via the same fqdn.

The mapping and matching (and the the relationships between events) is outside of the scope of the format, since it would bring in complexity.

That was the design decision back then.

The other option would have been to adopt formats such as STIX as internal formats, which seemed overkill back then.

...

If there are more than one source, the event can be split, so you have two events, each with one source.

That would be exactly the way to do it in IntelMQ.

...

Other formats like IDEA from warden do have this possibility.

indeed.

Alexander, in case you are interested in having other internal formats, that should be possible. But not trivial. Basically we do have the event/message classes [2] which could be a starting point for abstracting other internal formats such as IDEA. However, I prefer to make clear translator bots between different formats. And yes, some of them might be lossy and not be able to maintain internal structures.

The benefit that you get from KISS is that the tool stays useable for many people... .it was really a design decision.

I hope I could clarify things a bit.

Best, Aaron.

[1] https://github.com/abusesa/abusehelper/wiki/Data-Harmonization-Ontology [2] https://github.com/certtools/intelmq/blob/develop/intelmq/lib/message.py

-- // L. Aaron Kaplan kaplan@cert.at - T: +43 1 5056416 78 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - http://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Sebastian Wagner

13 Nov 13 Nov

11:57 a.m.

New subject: [Intelmq-dev] Data Harmonization - Fields with multiple values

There are two related things which are relevant for this discussion:

1) We need UUIDs per event to avoid loops - to be defined what "event" does mean in this context (https://github.com/certtools/intelmq/issues/901) 2) We need some kind of aggregation (https://github.com/certtools/intelmq/issues/751) - inside or outside of intelmq 3) We need some possibility to link between related events, which have been splitted because of multiple "alternative" values (more IPs per domain etc) (e.g. https://github.com/certtools/intelmq/issues/543 https://github.com/certtools/intelmq/issues/373)

ad 1) Should the UUID be inherited for alternative values as described in 3)? IMHO no, but that requires a second UUID Different tools which are working on data collected with intelmq can then link these events together using the UUIDs.

Once we can do 3) and have a possibility to save basic lists for e.g. abuse contacts, then the issue for Alexander is solved too (with an adapted harmonization).

Sebastian

-- // Sebastian Wagner wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

Navtej Singh

9 Nov 9 Nov

7:24 a.m.

New subject: [Intelmq-dev] Data Harmonization - Fields with multiple values

Would https://github.com/certtools/intelmq/issues/877 qualify as another use case?

On Wed, Nov 8, 2017 at 5:29 PM, Sebastian Wagner wagner@cert.at wrote:

...

Hi,

On 11/03/2017 06:26 AM, Knight, Alexander wrote:

At the Deepsec conference Sebastian mentioned updating the harmonization to allow for fields with multiple values. Has this issue been progressed at all?

The use case was the field abuse_contact which could be a list and then be concatenated (if necessary) with commas. Technically it is not hard to do it. In the develop branch I already have something similar (and more complex): a dictionary type named JSONDict. So, not directly, but some changes that should make a change easier.

There are some questions popping up that need to be clarified first:

How to define the types of the values inside the list? E.g. for the

abuse_contact it has to be a list of strings/email addresses

How should the "API" look like, or in other words: what should happen

for the in and setitem-operations etc

When should the list be converted to a string (or maybe also a

JSON-list)? E.g. for postgres output the abuse_contact could either be a json-list or a comma separated list, depending on the table's definition, but for NoSQL-databases and files it can be just the list itself.

And: what use cases do we have? That's good to know before thinking about how we implement that all:

We will require multiple values for some fields in our events,

What is in these fields? (type and/or example values) Where do you put that that and how do you want to work with in (inside intelmq)?

I'd like to hear opinions of other users and developers too!

Sebastian P.S.: I do have specific ideas, but don't want to bias others ;)

-- // Sebastian Wagner wagner@cert.at wagner@cert.at - T: +43 1 5056416 7201 // CERT Austria - https://www.cert.at/ // Eine Initiative der nic.at GmbH - https://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

2826

Age (days ago)

2836

Last active (days ago)

intelmq-dev@lists.cert.at

6 comments

4 participants

tags (0)

participants (4)

Knight, Alexander
L. Aaron Kaplan
Navtej Singh
Sebastian Wagner