Friday, July 14, 2017

Searching for a needle in a pcap haystack with pyshark

Faced with a bit of a challenge recently: I had a large (multi-megabyte) packet capture file from Wireshark and needed to extract information from the start of each SSL/TLS session in the capture. I could have used a Wireshark display filter to find SSL/TLS packets, but then manually sifting the client hello packets out of the capture and manually copying the needed data would have taken more time than I could spare for this task.

Fortunately, we can use the pyshark Python module to access packets in a pcap file using a loop and programmatically search for data in the packets of interest. I'm using MacPorts on MacOS, but pyshark doesn't seem to available, so I used "sudo /opt/local/bin/pip install pyshark" to install the module. I already have wireshark installed, and it conveniently has a link /usr/local/bin/tshark to run the text-mode wireshark tool needed by pyshark to extract data from pcap files.

thePacketGeek wrote a helpful series of articles on using pyshark, but didn't get as deep into the details of SSL/TLS packets as I needed. So, first step was to determine how to access the data of interest in SSL/TLS client hello packets. I extracted a single representative client hello packet from the large capture file using Wireshark's "Export Specified Packets" option in the file menu into a testing pcap file, and used the interactive Python interpreter to see what was available:

$ /opt/local/bin/python2.7
Python 2.7.13 (default, Apr 25 2017, 11:00:18)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyshark
>>> cap = pyshark.FileCapture('client-hello.pcapng')
>>> dir(cap[0])
>>> ['__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_packet_string', 'captured_length', 'eth', 'frame_info', 'get_multiple_layers', 'highest_layer', 'interface_captured', 'ip', 'layers', 'length', 'number', 'pretty_print', 'sniff_time', 'sniff_timestamp', 'ssl', 'tcp', 'transport_layer']


"ssl" looks interesting:

dir(cap[0].ssl)>>> ['', 'DATA_LAYER', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_all_fields', '_field_prefix', '_get_all_field_lines', '_get_all_fields_with_alternates', '_get_field_or_layer_repr', '_get_field_repr', '_layer_name', '_sanitize_field_name', 'field_names', 'get', 'get_field', 'get_field_by_showname', 'get_field_value', 'handshake', 'handshake_cipher_suites_length', 'handshake_ciphersuite', 'handshake_ciphersuites', 'handshake_comp_method', 'handshake_comp_methods', 'handshake_comp_methods_length', 'handshake_extension_len', 'handshake_extension_type', 'handshake_extensions_ec_point_format', 'handshake_extensions_ec_point_formats_length', 'handshake_extensions_elliptic_curve', 'handshake_extensions_elliptic_curves', 'handshake_extensions_elliptic_curves_length', 'handshake_extensions_length', 'handshake_extensions_reneg_info_len', 'handshake_extensions_server_name', 'handshake_extensions_server_name_len', 'handshake_extensions_server_name_list_len', 'handshake_extensions_server_name_type', 'handshake_extensions_status_request_exts_len', 'handshake_extensions_status_request_responder_ids_len', 'handshake_extensions_status_request_type', 'handshake_length', 'handshake_random', 'handshake_random_time', 'handshake_session_id_length', 'handshake_sig_hash_alg', 'handshake_sig_hash_alg_len', 'handshake_sig_hash_algs', 'handshake_sig_hash_hash', 'handshake_sig_hash_sig', 'handshake_type', 'handshake_version', 'layer_name', 'pretty_print', 'raw_mode', 'record', 'record_content_type', 'record_length', 'record_version']

pyshark pulled out a large number of named elements from this packet. I'm interested in the client hello's extension where the server name indication lives, so "handshake_extensions_server_name" looks useful.

cap[0].ssl.handshake_extensions_server_name
>>> 'www.bing.com'


It worked!

Now we can use this in a python script -- since not all packets in the capture are a TLS client hello with the Server Name Indication (SNI) extension, I wrapped the code into a try block to casually pass by any packets that didn't have the data I'm looking for, and call it from a loop over all the filename(s) on the command line:

import pyshark
import sys

def process(fn):
    cap = pyshark.FileCapture(input_file=fn, keep_packets=False)
    for pkt in cap:
        try:
            print pkt.ssl.handshake_extensions_server_name

        except AttributeError:
            pass


for i in range(1, len(sys.argv)):
    process(sys.argv[i])


(My actual program is a little more complex, but this is the fundamental task.)

This takes about 8 minutes to run through the hundreds of thousands of packets in a 125MB pcapng file, but saved hours of time that would have been needed to write an equivalent C++ program.