Info Loss: Searching Logs: A Work In Progress

A while back, I read a blog post at the SANS Internet Storm Center (ISC) handler's diary, "There's Value In Them There Logs" that piqued my interest. I'm well aware that logs are essential for error discovery and diagnosis as well as incident forensic analysis. The systems I build consistently provide valuable data in their logs to aid such analysis. However, I've long wanted an open-source centralized log tool that could merge and manage all my log data across all my systems.

In the ISC diary, there is a good diagram of a set of tools that can cooperate to build a useful log indexing and analysis system (rather than copy and describe all the components here, please see the original blog). I initially was a bit lost in the numerous pieces involved, but with a couple of days' worth of trial and investigation, it is making sense.

At the moment, I've pulled together Logstash to read and parse logs, ElasticSearch to store the log contents & indexes, and Kibana to visualize & search the log data. Logstash and ElasticSearch need a working Java Runtime (JRE); Kibana needs Ruby.

I initially followed the Logstash tutorials to get the Logstash component working. With all its flexibility, it can be a challenge to understand what Logstash is capable of, but the tutorial assists getting the software working, and by working through the steps I was able to figure out what Logstash was doing and why.

The standalone tutorial lead down the path of running Logstash in agent and web server modes. It wasn't clear to me immediately, but Logstash uses either an embedded ElasticSearch component or a companion ElasticSearch server to manage the log index and storage. I used logs from my mail server and other systems to feed it.

After my initial standalone trial, I tried out the centralized tutorial that uses Redis as a broker between Logstash instances and ElasticSearch. It was interesting to see how this functionality worked, but the centralized approach ended up complicating the architecture and diverting my attention from my goal: visualization and search.

Aside from my diversion into the centralized tutorial, something else was bothering me: the mail server logs I used were not being deeply parsed -- the log messages were being indexed and stored, but no semantics were being applied to the data. I wanted to be able to query on sendmail queue IDs, mail senders and recipients, rejected messages, and other useful data.

Logstash incorporates the very useful grok functionality to extract content and semantics from data using tagged regular expressions. Surprisingly, I didn't find built-in recipes to work with sendmail log data, so I rolled my own in this standalone logstash configuration:

input { stdin { type => "mail"}}
filter {
grok {
type => "mail"
pattern => [
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): timeout waiting for input from %{IPORHOST:timeoutHost} .*",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): Milter ($?<milter>\S+$| add|): (?<milterMsg>.*)",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): <(?<unknownUser>\S+)>\.\.\. User unknown",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: STARTTLS=(?<starttls>\S+), ((relay=%{IPORHOST:relay}( \[%{IPORHOST:relayip}\]( $may be forged$)?)?|version=(?<version>\S+)|verify=%{DATA:verify}|cipher=(?<cipher>[^,]+)|bits=(?<bits>\S+))(, |$))*",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>NOQUEUE): connect from (%{IPORHOST:host})?( ?\[%{IPORHOST:ip}\])?( ?$may be forged$?)?",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): ((to=(?<to>[^,]+)|from=(?<from>[^,]+)|ctladdr=(?<ctladdr>[^,]+)|delay=(?<delay>(\d+\+)?\d+:\d+:\d+)|xdelay=(?<xdelay>\d+:\d+:\d+)|mailer=(?<mailer>[^,]+)|pri=(?<pri>[^,]+)|dsn=(<dsn>[^,]+)|size=(?<size>\d+)|class=(?<class>\d+)|nrcpts=(?<nrcpts>\d+)|msgid=(?<msgid>[^,]+)|proto=(?<proto>[^,]+)|daemon=(?<daemon>[^,]+)|bodytype=(?<bodytype>\S+)|relay=(%{IPORHOST:relay})?( ?\[%{IPORHOST:relayip}\])?( ?$may be forged$?)?|reject=(?<reject>.*)|stat=(?<stat>[^,]+)|ruleset=(?<ruleset>[^,]+)|arg1=(?<arg1>[^,]+))(, |$))*",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(dovecot))(?:\[%{POSINT:pid}\])?: imap-login: Login: user=<%{DATA:user}>, method=%{DATA:method}, rip=%{IPORHOST:rip}, lip=%{IPORHOST:lip}, mpid=%{INT:mpid}(, TLS)?, session=<%{DATA:session}>",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(dovecot))(?:\[%{POSINT:pid}\])?: imap$%{DATA:user}$: (?<status>Disconnected: Logged out|Disconnected for inactivity) in=%{INT:in} out=%{INT:out}",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(opendkim))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): (?<milterMsg>.*)",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(milter-greylist))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): (?<milterMsg>.*)",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(MailScanner))(?:\[%{POSINT:pid}\])?: (?<mailScannerMsg>.*)"
]
}
}
output {
stdout { debug => true debug_format => "json"}
elasticsearch { host => "127.0.0.1" }
}

Along with sendmail message parsing, I added matches for a few dovecot imap server, opendkim milter, and greylist milter messages. Note that I set type => "mail" for the input and the filter sections; as a result, ElasticSearch has the type "mail" set on the data received from this input and filter. Also, Logstash sets the index name to "logstash-YYYY.MM.DD" (where YYYY is four-digit year, MM is month, and DD is day of month) for ElasticSearch -- this can be useful to know when it comes time to query and visualize the data.

With this configuration, I've been able to parse my mail logs using:

java -jar logstash-1.1.11-flatjar.jar agent -f logstash-maillog-elasticsearch.conf < maillog.0

(Note that the ElasticSearch server was running in the background, and receiving requests from Logstash at 127.0.0.1:9200)

After this, ElasticSearch was full of tagged data. Now, I'd like to see what I have in there. I tried the HTTP access via Logstash's port 9292 but was a little underwhelmed at the spartan interface.

I installed Kibana using the simple instructions and started it up. With my browser pointed at its TCP port 5601, I adjusted its time selector at the top left of the page and had immediate access to all the data.

Now I can click down into interesting stuff. Importantly, it is fast! It looks like I may need to tweak the regexes in my Logstash filters, but now I can quickly research any issues and spot trends that bear investigation.

A concern I have is the security of these tools. There is no authentication or authorization for access to the number of TCP ports opened by each of these pieces. I'm not sure if there is a way to secure these tools, or if they need to be run in an isolated environment. So far, I'm isolating them in a private VM.

Info Loss

Wednesday, May 15, 2013

Searching Logs: A Work In Progress

No comments:

Post a Comment