On 04/26/2017 05:24 PM, Otmar Lendl wrote:
Ad b) IMHO much more tricky is the issue of actually processing huge data-sets. Once you reach file-sizes in the GB range one needs to switch from "load everything into a data-structure in RAM, then process it" to a "load next few KB from a data-stream, process it, then get next slice".
My worry is that the current bot API cannot be easily converted to stream processing.
We need to think this through.
The ParserBot[1] uses generators (i.e. processing one line after another) except for one detail: Base64 decoding of `raw` - IMHO we should get rid of that anyway, it just blows up the size. Redis can handle the data without base64 just fine.
All parsers derived from ParserBot only overwrite single methods, they all work in the same way. But note that not all Parsers we have are converted to the ParserBot class yet, but that's nothing spectacular.
Sebastian
[1]: https://github.com/certtools/intelmq/blob/1.0.0.dev6/intelmq/lib/bot.py#L453