Dear Otmar, All
thank you very much for your detailled feedback!
Am Donnerstag 16 Juni 2016 13:04:00 schrieb Otmar Lendl:
I've been maintaining parsers (in a different context) for Shadowserver feeds for the last 3 years. Based on that experience a few comments:
- Don't assume that the field-names will stay constant. Be prepared to
support logic like "use 'ip' or 'srcip' for the IntelMQ 'source.ip'".
We have already seen this phenomenon, I guess the most recent change was "cc_ip". That's one of the reasons I extracted the mappings from the parser-code
For the Drone feed, I e.g. have the following mapping rules in our old system:
# Mapping from local CSV column names to eventDB column names $self->{eventdb_map} = { asn => "reported_asn", ip => "src_ip", hostname => "src_hostname", port => "src_port", cc => "dst_ip", cc_ip => "dst_ip", cc_port => "dst_port", cc_dns => "dst_fqdn", timestamp => "ts", url => "dst_url", geo => "reported_iso2cc", infection => "malware", machine_name => "local_hostname", # older names "Timestamp" => "ts", "Drone" => "src_ip", "ASN" => "reported_asn", "Geo" => "reported_iso2cc", "Hostname" => "src_hostname", "C&C" => "dst_ip", "C&C DNS" => "dst_fqdn", "C&C Port" => "dst_port", "Infection" => "malware", };
This seems to be the equal to our mapping.
- I see you support a fixup-function for each attribute. Yes, this is
needed but potentially not good enough. The reason is that you might need to manipulate multiple fields together, e.g. it varies by feed whether C&C URLs are transmitted as full URL or split up in proto/port/hostname/path. If you want to unify these fields, a single function per attribute will not do.
Yes, right now only one parameter is evaluated. I am aware that more complex operations might be required in the near future. I've also seen this requirement for the fqdn / url fields.
[see: https://github.com/certtools/intelmq/issues/524#issue-155435422, last point]
I'm not sure if deducting the correct information from the feed will work as expected. With our limited amount of data I could already see, that not in every case all information is available in order calculate the correct value. (protocoll missing, or might https on por 80 be possible). By calculating these values, on could make false assumptions.
Nevertheless, it's seems that this approach works out for you, at least for virustracker. This is great news.
HTH,
Yes, very much!
BR Dustin