Wednesday, May 15, 2013

Searching Logs: A Work In Progress

A while back, I read a blog post at the SANS Internet Storm Center (ISC) handler's diary, "There's Value In Them There Logs" that piqued my interest. I'm well aware that logs are essential for error discovery and diagnosis as well as incident forensic analysis. The systems I build consistently provide valuable data in their logs to aid such analysis. However, I've long wanted an open-source centralized log tool that could merge and manage all my log data across all my systems.

In the ISC diary, there is a good diagram of a set of tools that can cooperate to build a useful log indexing and analysis system (rather than copy and describe all the components here, please see the original blog). I initially was a bit lost in the numerous pieces involved, but with a couple of days' worth of trial and investigation, it is making sense.

At the moment, I've pulled together Logstash to read and parse logs, ElasticSearch to store the log contents & indexes, and Kibana to visualize & search the log data. Logstash and ElasticSearch need a working Java Runtime (JRE); Kibana needs Ruby.

I initially followed the Logstash tutorials to get the Logstash component working. With all its flexibility, it can be a challenge to understand what Logstash is capable of, but the tutorial assists getting the software working, and by working through the steps I was able to figure out what Logstash was doing and why.

The standalone tutorial lead down the path of running Logstash in agent and web server modes. It wasn't clear to me immediately, but Logstash uses either an embedded ElasticSearch component or a companion ElasticSearch server to manage the log index and storage. I used logs from my mail server and other systems to feed it.

After my initial standalone trial, I tried out the centralized tutorial that uses Redis as a broker between Logstash instances and ElasticSearch. It was interesting to see how this functionality worked, but the centralized approach ended up complicating the architecture and diverting my attention from my goal: visualization and search.

Aside from my diversion into the centralized tutorial, something else was bothering me: the mail server logs I used were not being deeply parsed -- the log messages were being indexed and stored, but no semantics were being applied to the data. I wanted to be able to query on sendmail queue IDs, mail senders and recipients, rejected messages, and other useful data.

Logstash incorporates the very useful grok functionality to extract content and semantics from data using tagged regular expressions. Surprisingly, I didn't find built-in recipes to work with sendmail log data, so I rolled my own in this standalone logstash configuration:

input { stdin { type => "mail"}}
filter {
  grok {
    type => "mail"
    pattern => [
    "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): timeout waiting for input from %{IPORHOST:timeoutHost} .*",
        "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): Milter (\(?<milter>\S+\)| add|): (?<milterMsg>.*)",
        "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): <(?<unknownUser>\S+)>\.\.\. User unknown",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: STARTTLS=(?<starttls>\S+), ((relay=%{IPORHOST:relay}( \[%{IPORHOST:relayip}\]( \(may be forged\))?)?|version=(?<version>\S+)|verify=%{DATA:verify}|cipher=(?<cipher>[^,]+)|bits=(?<bits>\S+))(, |$))*",
        "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>NOQUEUE): connect from (%{IPORHOST:host})?( ?\[%{IPORHOST:ip}\])?( ?\(may be forged\)?)?",
      "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): ((to=(?<to>[^,]+)|from=(?<from>[^,]+)|ctladdr=(?<ctladdr>[^,]+)|delay=(?<delay>(\d+\+)?\d+:\d+:\d+)|xdelay=(?<xdelay>\d+:\d+:\d+)|mailer=(?<mailer>[^,]+)|pri=(?<pri>[^,]+)|dsn=(<dsn>[^,]+)|size=(?<size>\d+)|class=(?<class>\d+)|nrcpts=(?<nrcpts>\d+)|msgid=(?<msgid>[^,]+)|proto=(?<proto>[^,]+)|daemon=(?<daemon>[^,]+)|bodytype=(?<bodytype>\S+)|relay=(%{IPORHOST:relay})?( ?\[%{IPORHOST:relayip}\])?( ?\(may be forged\)?)?|reject=(?<reject>.*)|stat=(?<stat>[^,]+)|ruleset=(?<ruleset>[^,]+)|arg1=(?<arg1>[^,]+))(, |$))*",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(dovecot))(?:\[%{POSINT:pid}\])?: imap-login: Login: user=<%{DATA:user}>, method=%{DATA:method}, rip=%{IPORHOST:rip}, lip=%{IPORHOST:lip}, mpid=%{INT:mpid}(, TLS)?, session=<%{DATA:session}>",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(dovecot))(?:\[%{POSINT:pid}\])?: imap\(%{DATA:user}\): (?<status>Disconnected: Logged out|Disconnected for inactivity) in=%{INT:in} out=%{INT:out}",
    "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(opendkim))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): (?<milterMsg>.*)",
    "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(milter-greylist))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): (?<milterMsg>.*)",
    "%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(MailScanner))(?:\[%{POSINT:pid}\])?: (?<mailScannerMsg>.*)"
output {
  stdout { debug => true debug_format => "json"}
  elasticsearch { host => "" }
Along with sendmail message parsing, I added matches for a few dovecot imap server, opendkim milter, and greylist milter messages. Note that I set type => "mail" for the input and the filter sections; as a result, ElasticSearch has the type "mail" set on the data received from this input and filter. Also, Logstash sets the index name to "logstash-YYYY.MM.DD" (where YYYY is four-digit year, MM is month, and DD is day of month) for ElasticSearch -- this can be useful to know when it comes time to query and visualize the data.

With this configuration, I've been able to parse my mail logs using:

java -jar logstash-1.1.11-flatjar.jar agent -f logstash-maillog-elasticsearch.conf < maillog.0
(Note that the ElasticSearch server was running in the background, and receiving requests from Logstash at

After this, ElasticSearch was full of tagged data. Now, I'd like to see what I have in there. I tried the HTTP access via Logstash's port 9292 but was a little underwhelmed at the spartan interface.

I installed Kibana using the simple instructions and started it up. With my browser pointed at its TCP port 5601, I adjusted its time selector at the top left of the page and had immediate access to all the data.

Now I can click down into interesting stuff. Importantly, it is fast! It looks like I may need to tweak the regexes in my Logstash filters, but now I can quickly research any issues and spot trends that bear investigation.

A concern I have is the security of these tools. There is no authentication or authorization for access to the number of TCP ports opened by each of these pieces. I'm not sure if there is a way to secure these tools, or if they need to be run in an isolated environment. So far, I'm isolating them in a private VM.

Wednesday, April 24, 2013

Security by Labels vs. Content

Generally, authorization security (determining whether a subject has access to data) is based on labels. For example, file pathnames determine what directory a file resides under, and accordingly, what discretionary access controls are assigned to the file. Firewalls determine what packets are authorized based on IP addresses and port numbers from packet headers. Document management systems often require users to apply tags to newly-scanned documents so the documents can be protected and routed appropriately.

These labels we assign to data (filenames, port numbers, tags, etc.) need to be representative of the information contents. We often depend on users to use appropriate and correct labels so we can implement hard and fast controls on the data.

Unfortunately, labels are often indeterminate or not representative of the content. For example, an HTTPS stream to a site like GotoMyPC that actually is providing remote access to a PC screen results in complete access to any data and applications on that PC, but the contents of that HTTPS stream can't be controlled short of blocking all access to the GotoMyPC web site.

Content-aware data loss prevention systems use a variety of approaches to authorize data (in use, at rest, or in motion) based on the actual content of the data. For those who understand and accept its approach, it enables deeper understanding of information and also enables more intelligent authorization decisions. DLP also provides a backstop when other access controls fail, such as when users forget to correctly tag a document.