Hi,
Am Mittwoch 26 April 2017 17:24:03 schrieb Otmar Lendl:
b) how to process larger data-sets
IMHO much more tricky is the issue of actually processing huge data-sets. Once you reach file-sizes in the GB range one needs to switch from "load everything into a data-structure in RAM, then process it" to a "load next few KB from a data-stream, process it, then get next slice".
note that there is code to split up line-based data, such as CSV, see https://github.com/certtools/intelmq/pull/680
c. events should have IDs. This will help in acknowledging the correct message in case of multi processing wrt to b.
Yes, but for a different reason: Assume more CERTs that do IntelMQ-IntelMQ cross-connects. You need a way to avoid building forwarding-loops. Persistent IDs can help (analogue to Message-IDs in the Usenet context).
The problem I see with this approach is that we do not have one origin of the information that could create a unique id for it. Let us say, two observing systems notice the same "abuse event" on a machine somewhere and start processing it, they might start two different ids for the same event. Just checking the id later for duplicate would not help.
Or let us say a single event gets an UID runs through two systems with processing it slight differently and then end up in one abuse system via two different sources. It is the same UID then, but different data details. Just rejecting the second incoming report on bases of the UID would throw additional info away and does not seem to be enough.
This is why I still think that one system should have a (working code) definition when it consideres two events being equal and then applying it to each report for deduplication and prevention of forward-loops.
(btw the Usenet analogy: some sort of Path: header would also be helpful: a list of Systems that this event has already passed through.)
Email and news headers cannot be trusted much, what would be gain from the info?
Best, Bernhard