Dear all,
currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.
Are you aware of setups of IntelMQ which use a different backend, like ZMQ?
Do you know of performance or other issues which occurred when using other Queuing Systems?
[1] https://github.com/certtools/intelmq/issues/547
Best Regards Dustin
On 22 Jul 2016, at 09:02, Dustin Demuth dustin.demuth@intevation.de wrote:
Dear all,
currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.
Are you aware of setups of IntelMQ which use a different backend, like ZMQ?
Do you know of performance or other issues which occurred when using other Queuing Systems?
We originally had RabbitMQ as a message queueing system. Redis was quicker. But there is no reason why we could not additionally add zmq as a possible backend (or rabbitMQ). That's why the pipeline.py module provides an abstraction.
Best, a.
On 22.07.2016 09:02, Dustin Demuth wrote:
Dear all,
currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.
Dustin,
500 MB is quite a significant amount of data and IMHO something where passing around such chunks in a RAM-based system is no longer sensible.
This does not only apply to IntelMQ: I've seen a number of code-bases fall over when they were confronted with large data-sets.
(Once a student wrote a tool for us to detect CMS versions based on a list of domains. All nice and fine for his test-data, but once we put the 1M .at domains in, it broke down. The "load everything into the RAM" approach has limits.)
IMHO:
I guess you hit this limit on some shadowserver feeds.
A sensible approach is to add some sort of "split" option to the collector bots. Yeah, this is not nice and perfect, but I'd add some logic like
if ( sizeof(collected data) > limit) { grab header-line foreach chunk of X data-lines { push headerline, chunk of data-lines into REDIS queue } } else { push original_data into REDIS queue }
to the collector bots. And (and I haven't checked this), make sure you never try to hold everything in memory. Download to temp-files and stream-process those.
If we're dealing with anything csv-like, the only additional info that piece of code needs is whether
* some extra comment line needs to be stripped * if there is a header line
otmar
On 22 Jul 2016, at 22:47, Otmar Lendl lendl@cert.at wrote:
On 22.07.2016 09:02, Dustin Demuth wrote:
Dear all,
currently we are facing the problem, that IntelMQ is not capable of handling large reports ( > 500 MB) when using Redis as a Message Queuing System. First we thought this might get fixed in most recent redis versions (see: [1]), but apparently this is not the case.
Dustin,
500 MB is quite a significant amount of data and IMHO something where passing around such chunks in a RAM-based system is no longer sensible.
This does not only apply to IntelMQ: I've seen a number of code-bases fall over when they were confronted with large data-sets.
(Once a student wrote a tool for us to detect CMS versions based on a list of domains. All nice and fine for his test-data, but once we put the 1M .at domains in, it broke down. The "load everything into the RAM" approach has limits.)
IMHO:
I guess you hit this limit on some shadowserver feeds.
A sensible approach is to add some sort of "split" option to the collector bots. Yeah, this is not nice and perfect, but I'd add some logic like
if ( sizeof(collected data) > limit) { grab header-line foreach chunk of X data-lines { push headerline, chunk of data-lines into REDIS queue } } else { push original_data into REDIS queue }
to the collector bots. And (and I haven't checked this), make sure you never try to hold everything in memory. Download to temp-files and stream-process those.
If we're dealing with anything csv-like, the only additional info that piece of code needs is whether
- some extra comment line needs to be stripped
- if there is a header line
+1 Totally agree... I think this is more sensible.
A.
Dear Otmar, thank you very much for your valuable input.
Am Freitag 22 Juli 2016 22:47:05 schrieb Otmar Lendl:
A sensible approach is to add some sort of "split" option to the collector bots.
We've already discussed this option here at intevation. The current trend is that we will do something like this.
I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.
Best Regards
Dustin
On 25 Jul 2016, at 15:02, Dustin Demuth dustin.demuth@intevation.de wrote:
Dear Otmar, thank you very much for your valuable input.
Am Freitag 22 Juli 2016 22:47:05 schrieb Otmar Lendl:
A sensible approach is to add some sort of "split" option to the collector bots.
We've already discussed this option here at intevation. The current trend is that we will do something like this.
I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.
Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!
Redis would be ok with UTF-8 (actually binary stuff).
Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!
Redis would be ok with UTF-8 (actually binary stuff).
Currently, it is not possible because data are serialized into JSON, which does not support binary data. So this change requires change serialization format, for example to msgpack, which supports binary data, it is smaller than JSON, probably faster in Python (https://gist.github.com/justinfx/3174062), is supported in Redis scripts and for example in Redis Desktop Manager too.
Jakub
Dne 26.7.2016 v 15:32 L. Aaron Kaplan napsal(a):
On 25 Jul 2016, at 15:02, Dustin Demuth dustin.demuth@intevation.de wrote:
Dear Otmar, thank you very much for your valuable input.
Am Freitag 22 Juli 2016 22:47:05 schrieb Otmar Lendl:
A sensible approach is to add some sort of "split" option to the collector bots.
We've already discussed this option here at intevation. The current trend is that we will do something like this.
I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.
Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!
Redis would be ok with UTF-8 (actually binary stuff).
Intelmq-dev mailing list Intelmq-dev@lists.cert.at http://lists.cert.at/cgi-bin/mailman/listinfo/intelmq-dev
On 26 Jul 2016, at 15:51, Jakub Onderka j.onderka@nbu.cz wrote:
Another option that we discussed here is to get rid of the base64 encoding. This saves quite some space as well!
Redis would be ok with UTF-8 (actually binary stuff).
Currently, it is not possible because data are serialized into JSON, which does not support binary data.
Sure, that's a intelmq specific thing. Redis per se would accept most anything.
So this change requires change serialization format, for example to msgpack, which supports binary data, it is smaller than JSON, probably faster in Python (https://gist.github.com/justinfx/3174062), is supported in Redis scripts and for example in Redis Desktop Manager too.
Please note: I was not saying we should be changing the intelmq (JSON) format. Just saying that Sebix and me discussed that theoretically we could get rid of the extra base64 encoding. This would save extra space (when the 512 MB redis key / value limit is the issue).
Best, a.
Dear All,
seems we have a solution for this problem now.
Bernhard has created a solution to split large csv-reports into chunks [1].
To do so, the collectors (in this case the "Mail-URL-Collector" which is the only one affected for our use case) is extended with `generate_reports()` from `intelmq.lib.splitreports`.
The collector can be extended with two parameters. Those are `chunk_size`, determining the size of each chunk (I don't know the unit yet, seems to be bytes), and `chunk_replicate_header` which replicates the first line of the file.
From my short look at the code, I see that splitreports cannot process lines which are comments (you might have seen those starting with a # sign).
Should this be integrated?
Am Montag 25 Juli 2016 15:02:55 schrieb Dustin Demuth:
I'm looking forward to see the solution we are creating right now. As of this writing I have not looked into detail. I'll report to the list when I know more.
[1] https://github.com/Intevation/intelmq/tree/dev-split-csv-reports
On 29 Jul 2016, at 08:45, Dustin Demuth dustin.demuth@intevation.de wrote:
Dear All,
seems we have a solution for this problem now.
Bernhard has created a solution to split large csv-reports into chunks [1].
To do so, the collectors (in this case the "Mail-URL-Collector" which is the only one affected for our use case) is extended with `generate_reports()` from `intelmq.lib.splitreports`.
The collector can be extended with two parameters. Those are `chunk_size`, determining the size of each chunk (I don't know the unit yet, seems to be bytes), and `chunk_replicate_header` which replicates the first line of the file.
From my short look at the code, I see that splitreports cannot process lines which are comments (you might have seen those starting with a # sign).
Should this be integrated?
sounds like a very good idea.
Please send a PR. Also - concerning the implementation: did you check if the python csv module does not already supply functions for that?
I skimmed the source code and it looks reasonable upon first inspection. I'd prefer to see the re-use of more standard libs (whenever possible).
Best, a.