Info Loss

Friday, July 14, 2017

Searching for a needle in a pcap haystack with pyshark

Faced with a bit of a challenge recently: I had a large (multi-megabyte) packet capture file from Wireshark and needed to extract information from the start of each SSL/TLS session in the capture. I could have used a Wireshark display filter to find SSL/TLS packets, but then manually sifting the client hello packets out of the capture and manually copying the needed data would have taken more time than I could spare for this task.

Fortunately, we can use the pyshark Python module to access packets in a pcap file using a loop and programmatically search for data in the packets of interest. I'm using MacPorts on MacOS, but pyshark doesn't seem to available, so I used "sudo /opt/local/bin/pip install pyshark" to install the module. I already have wireshark installed, and it conveniently has a link /usr/local/bin/tshark to run the text-mode wireshark tool needed by pyshark to extract data from pcap files.

thePacketGeek wrote a helpful series of articles on using pyshark, but didn't get as deep into the details of SSL/TLS packets as I needed. So, first step was to determine how to access the data of interest in SSL/TLS client hello packets. I extracted a single representative client hello packet from the large capture file using Wireshark's "Export Specified Packets" option in the file menu into a testing pcap file, and used the interactive Python interpreter to see what was available:

$ /opt/local/bin/python2.7
Python 2.7.13 (default, Apr 25 2017, 11:00:18)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyshark
>>> cap = pyshark.FileCapture('client-hello.pcapng')
>>> dir(cap[0])
>>> ['__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_packet_string', 'captured_length', 'eth', 'frame_info', 'get_multiple_layers', 'highest_layer', 'interface_captured', 'ip', 'layers', 'length', 'number', 'pretty_print', 'sniff_time', 'sniff_timestamp', 'ssl', 'tcp', 'transport_layer']

"ssl" looks interesting:

dir(cap[0].ssl)>>> ['', 'DATA_LAYER', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_all_fields', '_field_prefix', '_get_all_field_lines', '_get_all_fields_with_alternates', '_get_field_or_layer_repr', '_get_field_repr', '_layer_name', '_sanitize_field_name', 'field_names', 'get', 'get_field', 'get_field_by_showname', 'get_field_value', 'handshake', 'handshake_cipher_suites_length', 'handshake_ciphersuite', 'handshake_ciphersuites', 'handshake_comp_method', 'handshake_comp_methods', 'handshake_comp_methods_length', 'handshake_extension_len', 'handshake_extension_type', 'handshake_extensions_ec_point_format', 'handshake_extensions_ec_point_formats_length', 'handshake_extensions_elliptic_curve', 'handshake_extensions_elliptic_curves', 'handshake_extensions_elliptic_curves_length', 'handshake_extensions_length', 'handshake_extensions_reneg_info_len', 'handshake_extensions_server_name', 'handshake_extensions_server_name_len', 'handshake_extensions_server_name_list_len', 'handshake_extensions_server_name_type', 'handshake_extensions_status_request_exts_len', 'handshake_extensions_status_request_responder_ids_len', 'handshake_extensions_status_request_type', 'handshake_length', 'handshake_random', 'handshake_random_time', 'handshake_session_id_length', 'handshake_sig_hash_alg', 'handshake_sig_hash_alg_len', 'handshake_sig_hash_algs', 'handshake_sig_hash_hash', 'handshake_sig_hash_sig', 'handshake_type', 'handshake_version', 'layer_name', 'pretty_print', 'raw_mode', 'record', 'record_content_type', 'record_length', 'record_version']

pyshark pulled out a large number of named elements from this packet. I'm interested in the client hello's extension where the server name indication lives, so "handshake_extensions_server_name" looks useful.

cap[0].ssl.handshake_extensions_server_name
>>> 'www.bing.com'

It worked!

Now we can use this in a python script -- since not all packets in the capture are a TLS client hello with the Server Name Indication (SNI) extension, I wrapped the code into a try block to casually pass by any packets that didn't have the data I'm looking for, and call it from a loop over all the filename(s) on the command line:

import pyshark
import sys

def process(fn):
    cap = pyshark.FileCapture(input_file=fn, keep_packets=False)
    for pkt in cap:
        try:
            print pkt.ssl.handshake_extensions_server_name
        except AttributeError:
            pass

for i in range(1, len(sys.argv)):
    process(sys.argv[i])

(My actual program is a little more complex, but this is the fundamental task.)

This takes about 8 minutes to run through the hundreds of thousands of packets in a 125MB pcapng file, but saved hours of time that would have been needed to write an equivalent C++ program.

Wednesday, May 15, 2013

Searching Logs: A Work In Progress

A while back, I read a blog post at the SANS Internet Storm Center (ISC) handler's diary, "There's Value In Them There Logs" that piqued my interest. I'm well aware that logs are essential for error discovery and diagnosis as well as incident forensic analysis. The systems I build consistently provide valuable data in their logs to aid such analysis. However, I've long wanted an open-source centralized log tool that could merge and manage all my log data across all my systems.

In the ISC diary, there is a good diagram of a set of tools that can cooperate to build a useful log indexing and analysis system (rather than copy and describe all the components here, please see the original blog). I initially was a bit lost in the numerous pieces involved, but with a couple of days' worth of trial and investigation, it is making sense.

At the moment, I've pulled together Logstash to read and parse logs, ElasticSearch to store the log contents & indexes, and Kibana to visualize & search the log data. Logstash and ElasticSearch need a working Java Runtime (JRE); Kibana needs Ruby.

I initially followed the Logstash tutorials to get the Logstash component working. With all its flexibility, it can be a challenge to understand what Logstash is capable of, but the tutorial assists getting the software working, and by working through the steps I was able to figure out what Logstash was doing and why.

The standalone tutorial lead down the path of running Logstash in agent and web server modes. It wasn't clear to me immediately, but Logstash uses either an embedded ElasticSearch component or a companion ElasticSearch server to manage the log index and storage. I used logs from my mail server and other systems to feed it.

After my initial standalone trial, I tried out the centralized tutorial that uses Redis as a broker between Logstash instances and ElasticSearch. It was interesting to see how this functionality worked, but the centralized approach ended up complicating the architecture and diverting my attention from my goal: visualization and search.

Aside from my diversion into the centralized tutorial, something else was bothering me: the mail server logs I used were not being deeply parsed -- the log messages were being indexed and stored, but no semantics were being applied to the data. I wanted to be able to query on sendmail queue IDs, mail senders and recipients, rejected messages, and other useful data.

Logstash incorporates the very useful grok functionality to extract content and semantics from data using tagged regular expressions. Surprisingly, I didn't find built-in recipes to work with sendmail log data, so I rolled my own in this standalone logstash configuration:

input { stdin { type => "mail"}}
filter {
grok {
type => "mail"
pattern => [
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): timeout waiting for input from %{IPORHOST:timeoutHost} .*",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): Milter ($?<milter>\S+$| add|): (?<milterMsg>.*)",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): <(?<unknownUser>\S+)>\.\.\. User unknown",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: STARTTLS=(?<starttls>\S+), ((relay=%{IPORHOST:relay}( \[%{IPORHOST:relayip}\]( $may be forged$)?)?|version=(?<version>\S+)|verify=%{DATA:verify}|cipher=(?<cipher>[^,]+)|bits=(?<bits>\S+))(, |$))*",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>NOQUEUE): connect from (%{IPORHOST:host})?( ?\[%{IPORHOST:ip}\])?( ?$may be forged$?)?",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(sendmail|sm-mta[^,\[]+))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): ((to=(?<to>[^,]+)|from=(?<from>[^,]+)|ctladdr=(?<ctladdr>[^,]+)|delay=(?<delay>(\d+\+)?\d+:\d+:\d+)|xdelay=(?<xdelay>\d+:\d+:\d+)|mailer=(?<mailer>[^,]+)|pri=(?<pri>[^,]+)|dsn=(<dsn>[^,]+)|size=(?<size>\d+)|class=(?<class>\d+)|nrcpts=(?<nrcpts>\d+)|msgid=(?<msgid>[^,]+)|proto=(?<proto>[^,]+)|daemon=(?<daemon>[^,]+)|bodytype=(?<bodytype>\S+)|relay=(%{IPORHOST:relay})?( ?\[%{IPORHOST:relayip}\])?( ?$may be forged$?)?|reject=(?<reject>.*)|stat=(?<stat>[^,]+)|ruleset=(?<ruleset>[^,]+)|arg1=(?<arg1>[^,]+))(, |$))*",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(dovecot))(?:\[%{POSINT:pid}\])?: imap-login: Login: user=<%{DATA:user}>, method=%{DATA:method}, rip=%{IPORHOST:rip}, lip=%{IPORHOST:lip}, mpid=%{INT:mpid}(, TLS)?, session=<%{DATA:session}>",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(dovecot))(?:\[%{POSINT:pid}\])?: imap$%{DATA:user}$: (?<status>Disconnected: Logged out|Disconnected for inactivity) in=%{INT:in} out=%{INT:out}",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(opendkim))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): (?<milterMsg>.*)",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(milter-greylist))(?:\[%{POSINT:pid}\])?: (?<qid>\S+): (?<milterMsg>.*)",
"%{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} (?<program>(MailScanner))(?:\[%{POSINT:pid}\])?: (?<mailScannerMsg>.*)"
]
}
}
output {
stdout { debug => true debug_format => "json"}
elasticsearch { host => "127.0.0.1" }
}

Along with sendmail message parsing, I added matches for a few dovecot imap server, opendkim milter, and greylist milter messages. Note that I set type => "mail" for the input and the filter sections; as a result, ElasticSearch has the type "mail" set on the data received from this input and filter. Also, Logstash sets the index name to "logstash-YYYY.MM.DD" (where YYYY is four-digit year, MM is month, and DD is day of month) for ElasticSearch -- this can be useful to know when it comes time to query and visualize the data.

With this configuration, I've been able to parse my mail logs using:

java -jar logstash-1.1.11-flatjar.jar agent -f logstash-maillog-elasticsearch.conf < maillog.0

(Note that the ElasticSearch server was running in the background, and receiving requests from Logstash at 127.0.0.1:9200)

After this, ElasticSearch was full of tagged data. Now, I'd like to see what I have in there. I tried the HTTP access via Logstash's port 9292 but was a little underwhelmed at the spartan interface.

I installed Kibana using the simple instructions and started it up. With my browser pointed at its TCP port 5601, I adjusted its time selector at the top left of the page and had immediate access to all the data.

Now I can click down into interesting stuff. Importantly, it is fast! It looks like I may need to tweak the regexes in my Logstash filters, but now I can quickly research any issues and spot trends that bear investigation.

A concern I have is the security of these tools. There is no authentication or authorization for access to the number of TCP ports opened by each of these pieces. I'm not sure if there is a way to secure these tools, or if they need to be run in an isolated environment. So far, I'm isolating them in a private VM.

Wednesday, April 24, 2013

Security by Labels vs. Content

Generally, authorization security (determining whether a subject has access to data) is based on labels. For example, file pathnames determine what directory a file resides under, and accordingly, what discretionary access controls are assigned to the file. Firewalls determine what packets are authorized based on IP addresses and port numbers from packet headers. Document management systems often require users to apply tags to newly-scanned documents so the documents can be protected and routed appropriately.

These labels we assign to data (filenames, port numbers, tags, etc.) need to be representative of the information contents. We often depend on users to use appropriate and correct labels so we can implement hard and fast controls on the data.

Unfortunately, labels are often indeterminate or not representative of the content. For example, an HTTPS stream to a site like GotoMyPC that actually is providing remote access to a PC screen results in complete access to any data and applications on that PC, but the contents of that HTTPS stream can't be controlled short of blocking all access to the GotoMyPC web site.

Content-aware data loss prevention systems use a variety of approaches to authorize data (in use, at rest, or in motion) based on the actual content of the data. For those who understand and accept its approach, it enables deeper understanding of information and also enables more intelligent authorization decisions. DLP also provides a backstop when other access controls fail, such as when users forget to correctly tag a document.

Wednesday, April 18, 2012

Perfect Security?

Many years ago, I was privileged to hear Marcus Ranum speak at a conference for our regional NSFNet member network. At the time, I was of the mindset that it was possible to have perfect security for the computer systems and networks I managed, and I was not willing to compromise security for any purpose.

For example, when my employer at the time wanted to build a way to accept credit cards via the web, I proposed an isolated database server behind multiple firewalls -- mind you, this was long before PCI-DSS! Instead of taking the perfect solution, they probably just accepted credit card numbers via email...

Anyway, I understood Marcus to say that business needs had priority, and in particular, sometimes the business (and its software and systems) has to be built in advance of the security. This did not mean that we needed to ignore or discard security, but to be cognizant of the business needs -- if there's no business, there's no need for security.

So, we need to manage risks and prepare to respond to problems rather than wait to enable business operations until known risks are eliminated.

Friday, March 23, 2012

Verizon Data Breach Report 2012

The Verizon Data Breach Report 2012 (pdf) has been released. The information security industry owes Verizon gratitude for the amount of data Verizon has been able to assemble and analyze, and for making the results publicly available.

Unsurprisingly, the total number of records breached in 2011 was quite large. The majority of the breaches were motivated by "hacktivism" rather than illicit financial gains, but Verizon points out that serious criminals are still actively stealing data.

Regardless of the motivations by attackers, 2011 was a terrible year for the number of breaches and the amount of data lost.

Wednesday, March 14, 2012

RSA Conference 2012 Post-mortem

This year, my schedule at the RSA Conference 2012 was much different than previous conferences. As a speaker, I spent quite a bit of time preparing and rehearsing my presentation, as well as talking with other presenters. Of course, audiences get a lot out of the presentations and meeting the presenters afterwards, but it's a step up to be able to meet and talk with presenters informally about the industry, security issues and solutions for customers, and the direction of technologies.

Looking back at the past year and the significant number of huge data loss events, I thought I saw that people were looking to step up their game against breaches. I liked what I heard from industry industry leaders - concepts with the potential to improve data security: 1) better communication and interaction between software development and operations, such as Josh Corman and Gene Kim's Rugged DevOps talk, 2) improving security functionality for cloud - Chris Hoff and Rich Mogul's Grilling Cloudicorns talk, and 3) improving mobile device security.

I'm looking forward to digging into these ideas further in the coming year.

Thursday, February 2, 2012

RSA Conference 2012 - Data Breaches and Web Servers: The Giant Sucking Sound

I'm scheduled to present "Data Breaches and Web Servers: The Giant Sucking Sound" at RSA Conference 2012 - session DAS-204 on Wednesday, February 29.
From the abstract:

An analysis of recent data breach events shows a large number of events occur via web servers. Barracuda, Epsilon, Citigroup, eHarmony, Sony and the State of Texas are just a few of the names in the news as a result of web data exposures. Web servers in the cloud only complicate the situation. This presentation will examine technologies and practices you can apply to help keep your name off this list.

Since I submitted the abstract several months ago, there have been several additional major breaches of web servers including Stratfor, Zappos and Care2, so the giant sucking continues.

Hope to meet you at the RSA Conference!

Guy