On 22.07.2016 09:02, Dustin Demuth wrote:
Dear all,
currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.
Dustin,
500 MB is quite a significant amount of data and IMHO something where passing around such chunks in a RAM-based system is no longer sensible.
This does not only apply to IntelMQ: I've seen a number of code-bases fall over when they were confronted with large data-sets.
(Once a student wrote a tool for us to detect CMS versions based on a list of domains. All nice and fine for his test-data, but once we put the 1M .at domains in, it broke down. The "load everything into the RAM" approach has limits.)
IMHO:
I guess you hit this limit on some shadowserver feeds.
A sensible approach is to add some sort of "split" option to the collector bots. Yeah, this is not nice and perfect, but I'd add some logic like
if ( sizeof(collected data) > limit) { grab header-line foreach chunk of X data-lines { push headerline, chunk of data-lines into REDIS queue } } else { push original_data into REDIS queue }
to the collector bots. And (and I haven't checked this), make sure you never try to hold everything in memory. Download to temp-files and stream-process those.
If we're dealing with anything csv-like, the only additional info that piece of code needs is whether
* some extra comment line needs to be stripped * if there is a header line
otmar