Hi,

On 7/26/21 3:04 PM, Guillaume GRANJON DE LEPINEY wrote:
I wonder if there is a simple way to use a Deduplicator bot on an optional field. Indeed, I noticed when I apply the deduplicator on an optional field that the null value must be entered in the redis because all messages (except the first one) that do not contain the field are dropped.

Is there a workaround please?

 

I could work around this problem by adding two Sieve bots at the exit of the precedent bot that would jump the Deduplicator bot if the message doesn't have the field, but I don't find that to be optimal. Thus, I am open to any proposal that could help me.

The message-hash method ignores any non-existing key: https://github.com/certtools/intelmq/blob/8a8107ec6b332e710626d056b2b0446ab976775f/intelmq/lib/message.py#L404-L405

if filter_type == "whitelist" and key not in filter_keys:
continue

You could either filter these messages out just before the deduplicator, but I don't see a reason for two sieve bots, one should be sufficient, plus using paths (see https://intelmq.readthedocs.io/en/latest/user/bots.html#sieve).

(btw: If someone tackles https://github.com/certtools/intelmq/issues/1250, the simpler filter expert would also work)

If that's not viable for you, then you'd need to adapt the deduplicator's code a bit, probably also introducing additional parameters. Using the Message.set_default_value is not possible either, as that would set a constant, leading to the same behavior as you have now.

I hope that helps a bit

Sebastian

-- 
// Sebastian Wagner <wagner@cert.at> - T: +43 676 898 298 7201
// CERT Austria - https://www.cert.at/
// Eine Initiative der nic.at GmbH - https://www.nic.at/
// Firmenbuchnummer 172568b, LG Salzburg