Reports larger than 500 MB in IntelMQ

List overview All Threads
Download

newer

older

Taxonomies & Sharing mechanism

another naming update

Dustin Demuth

22 Jul 2016 22 Jul '16

9:02 a.m.

Dear all,

currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.

Are you aware of setups of IntelMQ which use a different backend, like ZMQ?

Do you know of performance or other issues which occurred when using other Queuing Systems?

[1] https://github.com/certtools/intelmq/issues/547

Best Regards Dustin

-- dustin.demuth@intevation.de https://intevation.de/ OpenPGP key: B40D2EFF Intevation GmbH, Neuer Graben 17, 49074 Osnabrück; AG Osnabrück, HR B 18998 Geschäftsführer: Frank Koormann, Bernhard Reiter, Dr. Jan-Oliver Wagner

Attachments:

signature.asc (application/pgp-signature — 819 bytes)

Show replies by date

L. Aaron Kaplan

22 Jul 22 Jul

10:27 a.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

...

On 22 Jul 2016, at 09:02, Dustin Demuth dustin.demuth@intevation.de wrote:

Dear all,

currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.

Are you aware of setups of IntelMQ which use a different backend, like ZMQ?

Do you know of performance or other issues which occurred when using other Queuing Systems?

We originally had RabbitMQ as a message queueing system. Redis was quicker. But there is no reason why we could not additionally add zmq as a possible backend (or rabbitMQ). That's why the pipeline.py module provides an abstraction.

Best, a.

Otmar Lendl

10:47 p.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

On 22.07.2016 09:02, Dustin Demuth wrote:

...

Dear all,

currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.

Dustin,

500 MB is quite a significant amount of data and IMHO something where passing around such chunks in a RAM-based system is no longer sensible.

This does not only apply to IntelMQ: I've seen a number of code-bases fall over when they were confronted with large data-sets.

(Once a student wrote a tool for us to detect CMS versions based on a list of domains. All nice and fine for his test-data, but once we put the 1M .at domains in, it broke down. The "load everything into the RAM" approach has limits.)

IMHO:

I guess you hit this limit on some shadowserver feeds.

A sensible approach is to add some sort of "split" option to the collector bots. Yeah, this is not nice and perfect, but I'd add some logic like

if ( sizeof(collected data) > limit) { grab header-line foreach chunk of X data-lines { push headerline, chunk of data-lines into REDIS queue } } else { push original_data into REDIS queue }

to the collector bots. And (and I haven't checked this), make sure you never try to hold everything in memory. Download to temp-files and stream-process those.

If we're dealing with anything csv-like, the only additional info that piece of code needs is whether

* some extra comment line needs to be stripped * if there is a header line

otmar

-- // Otmar Lendl lendl@cert.at - T: +43 1 5056416 711 // CERT Austria - http://www.cert.at/ // Eine Initiative der nic.at GmbH - http://www.nic.at/ // Firmenbuchnummer 172568b, LG Salzburg

L. Aaron Kaplan

11:49 p.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

...

On 22 Jul 2016, at 22:47, Otmar Lendl lendl@cert.at wrote:

On 22.07.2016 09:02, Dustin Demuth wrote:

...
Dear all,

currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.

Dustin,

500 MB is quite a significant amount of data and IMHO something where passing around such chunks in a RAM-based system is no longer sensible.

This does not only apply to IntelMQ: I've seen a number of code-bases fall over when they were confronted with large data-sets.

(Once a student wrote a tool for us to detect CMS versions based on a list of domains. All nice and fine for his test-data, but once we put the 1M .at domains in, it broke down. The "load everything into the RAM" approach has limits.)

IMHO:

I guess you hit this limit on some shadowserver feeds.

A sensible approach is to add some sort of "split" option to the collector bots. Yeah, this is not nice and perfect, but I'd add some logic like

if ( sizeof(collected data) > limit) { grab header-line foreach chunk of X data-lines { push headerline, chunk of data-lines into REDIS queue } } else { push original_data into REDIS queue }

to the collector bots. And (and I haven't checked this), make sure you never try to hold everything in memory. Download to temp-files and stream-process those.

If we're dealing with anything csv-like, the only additional info that piece of code needs is whether

some extra comment line needs to be stripped

if there is a header line

+1 Totally agree... I think this is more sensible.

Dustin Demuth

25 Jul 25 Jul

3:02 p.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

Dear Otmar, thank you very much for your valuable input.

Am Freitag 22 Juli 2016 22:47:05 schrieb Otmar Lendl:

...

A sensible approach is to add some sort of "split" option to the collector bots.

We've already discussed this option here at intevation. The current trend is that we will do something like this.

I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.

Best Regards

Dustin

L. Aaron Kaplan

26 Jul 26 Jul

3:32 p.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

...

On 25 Jul 2016, at 15:02, Dustin Demuth dustin.demuth@intevation.de wrote:

Dear Otmar, thank you very much for your valuable input.

Am Freitag 22 Juli 2016 22:47:05 schrieb Otmar Lendl:

...
A sensible approach is to add some sort of "split" option to the collector bots.

We've already discussed this option here at intevation. The current trend is that we will do something like this.

I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.

Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!

Redis would be ok with UTF-8 (actually binary stuff).

Jakub Onderka

3:51 p.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

...

Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!

Redis would be ok with UTF-8 (actually binary stuff).

Currently, it is not possible because data are serialized into JSON, which does not support binary data. So this change requires change serialization format, for example to msgpack, which supports binary data, it is smaller than JSON, probably faster in Python (https://gist.github.com/justinfx/3174062), is supported in Redis scripts and for example in Redis Desktop Manager too.

Jakub

Dne 26.7.2016 v 15:32 L. Aaron Kaplan napsal(a):

...

...
On 25 Jul 2016, at 15:02, Dustin Demuth dustin.demuth@intevation.de wrote:

Dear Otmar, thank you very much for your valuable input.

Am Freitag 22 Juli 2016 22:47:05 schrieb Otmar Lendl:

...
A sensible approach is to add some sort of "split" option to the collector bots.

We've already discussed this option here at intevation. The current trend is that we will do something like this.

I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.

Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!

Redis would be ok with UTF-8 (actually binary stuff).

Intelmq-dev mailing list Intelmq-dev@lists.cert.at http://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev

-- Jakub Onderka Národní bezpečnostní úřad Národní centrum kybernetické bezpečnosti www.govcert.cz

L. Aaron Kaplan

3:59 p.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

...

On 26 Jul 2016, at 15:51, Jakub Onderka j.onderka@nbu.cz wrote:

...
Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!

Redis would be ok with UTF-8 (actually binary stuff).

Currently, it is not possible because data are serialized into JSON, which does not support binary data.

Sure, that's a intelmq specific thing. Redis per se would accept most anything.

...

So this change requires change serialization format, for example to msgpack, which supports binary data, it is smaller than JSON, probably faster in Python (https://gist.github.com/justinfx/3174062), is supported in Redis scripts and for example in Redis Desktop Manager too.

Please note: I was not saying we should be changing the intelmq (JSON) format. Just saying that Sebix and me discussed that theoretically we could get rid of the extra base64 encoding. This would save extra space (when the 512 MB redis key / value limit is the issue).

Best, a.

Dustin Demuth

29 Jul 29 Jul

8:45 a.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

Dear All,

seems we have a solution for this problem now.

Bernhard has created a solution to split large csv-reports into chunks [1].

To do so, the collectors (in this case the "Mail-URL-Collector" which is the only one affected for our use case) is extended with `generate_reports()` from `intelmq.lib.splitreports`.

The collector can be extended with two parameters. Those are `chunk_size`, determining the size of each chunk (I don't know the unit yet, seems to be bytes), and `chunk_replicate_header` which replicates the first line of the file.

From my short look at the code, I see that splitreports cannot process lines which are comments (you might have seen those starting with a # sign).

Should this be integrated?

Am Montag 25 Juli 2016 15:02:55 schrieb Dustin Demuth:

...

I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.

[1] https://github.com/Intevation/intelmq/tree/dev-split-csv-reports

L. Aaron Kaplan

1:39 p.m.

New subject: [Intelmq-dev] Reports larger than 500 MB in IntelMQ

...

On 29 Jul 2016, at 08:45, Dustin Demuth dustin.demuth@intevation.de wrote:

Dear All,

seems we have a solution for this problem now.

Bernhard has created a solution to split large csv-reports into chunks [1].

To do so, the collectors (in this case the "Mail-URL-Collector" which is the only one affected for our use case) is extended with `generate_reports()` from `intelmq.lib.splitreports`.

The collector can be extended with two parameters. Those are `chunk_size`, determining the size of each chunk (I don't know the unit yet, seems to be bytes), and `chunk_replicate_header` which replicates the first line of the file.

From my short look at the code, I see that splitreports cannot process lines which are comments (you might have seen those starting with a # sign).

Should this be integrated?

sounds like a very good idea.

Please send a PR. Also - concerning the implementation: did you check if the python csv module does not already supply functions for that?

I skimmed the source code and it looks reasonable upon first inspection. I'd prefer to see the re-use of more standard libs (whenever possible).

Best, a.

3297

Age (days ago)

3304

Last active (days ago)

intelmq-dev@lists.cert.at

9 comments

4 participants

tags (0)

participants (4)

Dustin Demuth
Jakub Onderka
L. Aaron Kaplan
Otmar Lendl