Multi file I/O

pcap

When files grow to infinity

It happened to me and probably to you as well, somebody hands over pcaps of 20 TByte to you. You start multiple T2 as a background process and then after 7 TByte something goes wrong, and you have to start all over again. Grrrrr.

Although T2 has no problem with huge pcap files it is a nuisance, I guess you concur. But what to do if you split them up and having 2000 files 10 GByte long? Don’t worry, the anteater can handle that.

Now you wrote the most sophisticated and genius online post-processing of your flow file and suddenly you run out of disk space. Bummer! Especially if you are only interested in a certain time span or selection of traffic you like to split the resulting flow files to a more manageable size.

And what happens, if the pcaps copied to your computer by an obscure process, and you don’t want T2 to timeout if he runs out of food. So he should wait for the new ones to come preserving it internal state. The polling mode will come to the rescue.

Preparation

First, restore T2 into a pristine state by removing all unnecessary or older plugins from the plugin folder ~/.tranalyzer/plugins:

t2build -e -y

Are you sure you want to empty the plugin folder '/home/wurst/.tranalyzer/plugins' (y/N)? yes
Plugin folder emptied

Then compile the core (tranalyzer2) and the following plugins:

t2build tranalyzer2 basicFlow basicStats tcpStates txtSink

...
BUILD SUCCESSFUL

If you did not create a separate data and results directory yet, please do it now in another bash window, that facilitates your workflow:

mkdir ~/data ~/results

The sample PCAPs used in this tutorial can be downloaded here: annoloc2.pcap.

Please save it in your ~/data folder.

Now you are all set!

PCAP fragmentation

Now fragment the PCAP file into a sequence of 10MB pcaps using tcpdump and editcap so that we can test some different filename formats.

mkdir ~/data/S

tcpdump -r ~/data/annoloc2.pcap -w ~/data/S/annoloc2S.pcap -C 10

ls ~/data/S

annoloc2S.pcap  annoloc2S.pcap1  annoloc2S.pcap2  annoloc2S.pcap3  annoloc2S.pcap4  annoloc2S.pcap5  annoloc2S.pcap6  annoloc2S.pcap7  annoloc2S.pcap8

mkdir ~/data/T

editcap -c 100000 ~/data/annoloc2.pcap ~/data/T/annoloc2T.pcap

ls ~/data/T

annoloc2T_00000_20020523183501.pcap  annoloc2T_00003_20020523183507.pcap  annoloc2T_00006_20020523183514.pcap  annoloc2T_00009_20020523183520.pcap  annoloc2T_00012_20020523183526.pcap
annoloc2T_00001_20020523183503.pcap  annoloc2T_00004_20020523183509.pcap  annoloc2T_00007_20020523183516.pcap  annoloc2T_00010_20020523183522.pcap
annoloc2T_00002_20020523183505.pcap  annoloc2T_00005_20020523183511.pcap  annoloc2T_00008_20020523183518.pcap  annoloc2T_00011_20020523183524.pcap

Now you are ready for some kung-fu reading.

Read from several defined pcaps in a row

Assume you have a lot of files, e.g. which are not comfortably numbered as in our case, but in time sequence over months and years. Then you can use the -R option where T2 accepts a file containing a list of pcaps.

t2 -R PCAPLIST -w outputfile

It processes all the pcap files listed in PCAPLIST. T2 keeps its internal state during the file change, thus all pcaps are treated as one large pcap.

The processing order is defined by the location of the filenames in the text file, so no sequential numbering is necessary. Nevertheless, the absolute path has to be specified. To generate the PCAPLIST you may use the commands below.

ls ~/data/S/annoloc2S* | sort > ~/data/pcap_Slist.txt

cat ~/data/pcap_Slist.txt

# List of PCAP files to process
/home/wurst/data/S/annoloc2S.pcap
/home/wurst/data/S/annoloc2S.pcap1
/home/wurst/data/S/annoloc2S.pcap2
/home/wurst/data/S/annoloc2S.pcap3
/home/wurst/data/S/annoloc2S.pcap4
/home/wurst/data/S/annoloc2S.pcap5
/home/wurst/data/S/annoloc2S.pcap6
/home/wurst/data/S/annoloc2S.pcap7
/home/wurst/data/S/annoloc2S.pcap8

Lines starting with # are considered as comments and thus ignored by T2. An easier way is to use the t2caplist script to generate such a list.

t2caplist -h

Usage:
    t2caplist [OPTION...] <FILE|DIR>

Optional arguments:
    -d depth          List pcaps up to the given depth
    -L                Follow symbolic links
    -r                List pcaps recursively
    -R                Sort the list in reverse order
    -z                Sort the list by file size
    -s                Do not sort the list
    -v                Report invalid files to stderr

Help and documentation arguments:
    -h, -?, --help    Show this help, then exi

It can even follow symbolic links, sort the files, but here we just generate a list and see what happens. So start t2 using the -R option on the generated ~/data/pcap_Slist.txt file

t2 -R ~/data/pcap_Slist.txt -w ~/results/S/

First T2 checks all files, whether they exist and whether they are sound. Then he processes one pcap after the other listed in ~/data/pcap_Slist.txt and terminates with a standard end report.

Read from a sequence of pcaps

Imagine you have a humongous amount of pcaps to process, and lucky you, they are produced with an index in the file name. Then the -D option is the way to go.

The -D option as specified below demands a FILEPREFIX, even as a regex *. If there is an extension, you have to specify it. The general option is shown below:

t2 -D FILEPREFIX[#Start][*][.ext][#Start][:SCHR][,#Stop]

Whereas #Start denotes the start index of the filename embedded in the file name or after the filename, #Stop the stop index. If the first is omitted T2 starts at 0 or assumes there is no number. If you omit the latter, T2 will wait for the next pcap if he runs out of food.

SCHR denotes the search characters for T2, where to find the #Start number in an arbitrary file name. It can contain up to three characters. By default SCHR is set to p, as defined in tranalyzer.h. Open the latter and search for // -D option parameters:

tranalyzer2

vi src/tranalyzer.h

...
// -D option parameters
#define RROP      0    // round robin operation
#define POLLTM    5    // poll timing in sec for files
#define MFPTMOUT  0    // > 0: timeout in sec for poll timing > POLLTM, 0: no poll timeout
#define SCHR     'p'   // separating char for number (refer to the doc for examples)
...

The POLLTM denotes the poll interval T2 checks whether the next missing file is available under his data directory. If a file index is missing, aka no more food for the anteater, he will wait and poll every POLLTM seconds. This and the other constants will be discussed under polling timeout.

We chose 'p' as the default because tcpdump adds the index at the end of the file name, behind the pcap extension i.e. out.pcapNUM. Nevertheless, t2 covers also the more complicated editcap filename format.

The following table summarizes the supported naming patterns and the configuration required: Note the quotes (") which are necessary to avoid preemptive interpretation of regex characters, e.g. "*".

Filenames Command
out, out1, out2, … t2 -D out:t
out.pcap, out.pcap1, out.pcap2, … t2 -D out.pcap
out.pcap, out.pcap01, out.pcap02, … t2 -D out.pcap00
out.pcap, out1.pcap, out2.pcap, … t2 -D "out*.pcap:t"
out0.pcap, out1.pcap, out2.pcap, … t2 -D out0.pcap:t
out00.pcap, out01.pcap, out02.pcap, … t2 -D out00.pcap:t
out_00_Wurst.pcap, out_01_Nudel.pcap, out_02_Knoedel.pcap t2 -D "out_00_*.pcap:t_,2"
out_24.4.20h00.pcap, out_24.4.2016.20h00.pcap1, … t2 -D "out*.pcap"
out_24.4.20h00.pcap00, out_24.04.20h00.pcap01, … t2 -D "out*.pcap00"
out0.pcap, out1.pcap, ou2.pcap, … t2 -D out0.pcap:t
out.pcap00, out.pcap01, out.pcap02, … t2 -D out.pcap00

So if you want to process all files in the tcpdump split format from index 2 to 4:

t2 -D "~/data/S/annoloc2S.pcap2,4" -w ~/results/S/

The same for the editcap format: Note again the compulsory quotes for the regex processing.

t2 -D "~/data/T/annoloc2T_00002_*.pcap:T_,4" -w ~/results/T/

The end reports differ because the fragments of tcpdump and editcap are different.

Polling timeout

If T2 is running out of files the default behavior of the -D option is to wait for the next file. So you could leave him running somewhere, lurking for more food until you copy the next pcap into his bowl. Try this:

t2 -D ~/data/S/annoloc2S.pcap -w ~/results/S/

================================================================================
Tranalyzer 0.8.14 (Anteater), Tarantula. PID: 48769
================================================================================
[INF] Creating flows for L2, IPv4, IPv6
Active plugins:
    01: basicFlow, 0.8.14
    02: basicStats, 0.8.14
    03: tcpStates, 0.8.14
    04: txtSink, 0.8.14
[INF] IPv4 Ver: 5, Rev: 16122020, Range Mode: 0, subnet ranges loaded: 406105 (406.11 K)
[INF] IPv6 Ver: 5, Rev: 17122020, Range Mode: 0, subnet ranges loaded: 51345 (51.34 K)
Processing file: /home/wurst/data/S/annoloc2S.pcap
Link layer type: Ethernet [EN10MB/1]
Dump start: 1022171701.691172 sec (Thu 23 May 2002 16:35:01 GMT)
[WRN] snapL2Length: 54 - snapL3Length: 40 - IP length in header: 1500
Processing file: /home/wurst/data/S/annoloc2S.pcap1
Processing file: /home/wurst/data/S/annoloc2S.pcap2
Processing file: /home/wurst/data/S/annoloc2S.pcap3
Processing file: /home/wurst/data/S/annoloc2S.pcap4
Processing file: /home/wurst/data/S/annoloc2S.pcap5
Processing file: /home/wurst/data/S/annoloc2S.pcap6
Processing file: /home/wurst/data/S/annoloc2S.pcap7
Processing file: /home/wurst/data/S/annoloc2S.pcap8
Processing file: /home/wurst/data/S/annoloc2S.pcap9
...........Processing file: /home/wurst/data/S/annoloc2S.pcap9
...........

Now open another bash window and copy annoloc2S.pcap to annoloc2S.pcap9. It does not make sense, but it helps to demonstrate t2’s reaction.

cd ~/data/S

cp annoloc2S.pcap annoloc2S.pcap9

In the T2 window you will suddenly see that he grabs the new file, processes it and waits for the next victim. Now imagine that No 9 is missing, then T2 waits for ever, even if additional pcaps having a higher index are copied in his data folder. Sometimes No 9 will never come and bring everything to a sudden halt. In order to avoid that, for certain overall statistical analysis, or monitoring it is preferable to skip the missing file and move on. For that purpose T2 implements a poll timeout constant MFPTMOUT. It defines the number of seconds until T2 moves on the next file index.

Terminate t2 now with ^C^C and you get the end report and all flows which did not terminate so far, will be unloaded into the flow file.

tranalyzer2

vi src/tranalyzer.h

// -D option parameters
#define RROP      0    // round robin operation
#define POLLTM    5    // poll timing in sec for files
#define MFPTMOUT  0    // > 0: timeout n sec for poll timing > POLLTM, 0: no poll timeout
#define SCHR     'p'   // separating char for number (refer to the doc for examples)

So rename annoloc2S.pcap9 to annoloc2S.pcap10, so that we have a gap.

cd ~/data/S

mv annoloc2S.pcap9 annoloc2S.pcap10

Then set the timeout for poll timing to 10 seconds, so that T2 waits for that period for the No 9 to arrive, otherwise he moves on to No 10. Recompile and rerun T2 on the same pcap.

t2conf tranalyzer2 -D MFPTMOUT=10 && t2build tranalyzer2

t2 -D ~/data/S/annoloc2S.pcap -w ~/results/S/

================================================================================
Tranalyzer 0.8.14 (Anteater), Tarantula. PID: 49073
================================================================================
[INF] Creating flows for L2, IPv4, IPv6
Active plugins:
    01: basicFlow, 0.8.14
    02: basicStats, 0.8.14
    03: tcpStates, 0.8.14
    04: txtSink, 0.8.14
[INF] IPv4 Ver: 5, Rev: 16122020, Range Mode: 0, subnet ranges loaded: 406105 (406.11 K)
[INF] IPv6 Ver: 5, Rev: 17122020, Range Mode: 0, subnet ranges loaded: 51345 (51.34 K)
Processing file: /home/wurst/data/S/annoloc2S.pcap
Link layer type: Ethernet [EN10MB/1]
Dump start: 1022171701.691172 sec (Thu 23 May 2002 16:35:01 GMT)
[WRN] snapL2Length: 54 - snapL3Length: 40 - IP length in header: 1500
Processing file: /home/wurst/data/S/annoloc2S.pcap1
Processing file: /home/wurst/data/S/annoloc2S.pcap2
Processing file: /home/wurst/data/S/annoloc2S.pcap3
Processing file: /home/wurst/data/S/annoloc2S.pcap4
Processing file: /home/wurst/data/S/annoloc2S.pcap5
Processing file: /home/wurst/data/S/annoloc2S.pcap6
Processing file: /home/wurst/data/S/annoloc2S.pcap7
Processing file: /home/wurst/data/S/annoloc2S.pcap8
.....Processing file: /home/wurst/data/S/annoloc2S.pcap10
...........

Round robin operation

In order to automate the flow file post processing and to conserve disk space a round robin approach is very helpful. The number of the round robin rollover should be adapted to the post processing speed and the size of the fragments. As a test switch on RROP, set the roll over index to 8 at the command line and reset the polling timeout mode, as we do not need it for the following demonstration:

t2conf tranalyzer2 -D RROP=1 -D MFPTMOUT=0 && t2build tranalyzer2

t2 -D ~/data/S/annoloc2S.pcap,8 -w ~/results/S/

================================================================================
Tranalyzer 0.8.14 (Anteater), Tarantula. PID: 49401
================================================================================
[INF] Creating flows for L2, IPv4, IPv6
Active plugins:
    01: basicFlow, 0.8.14
    02: basicStats, 0.8.14
    03: tcpStates, 0.8.14
    04: txtSink, 0.8.14
[INF] IPv4 Ver: 5, Rev: 16122020, Range Mode: 0, subnet ranges loaded: 406105 (406.11 K)
[INF] IPv6 Ver: 5, Rev: 17122020, Range Mode: 0, subnet ranges loaded: 51345 (51.34 K)
Processing file: /home/wurst/data/S/annoloc2S.pcap
Link layer type: Ethernet [EN10MB/1]
Dump start: 1022171701.691172 sec (Thu 23 May 2002 16:35:01 GMT)
[WRN] snapL2Length: 54 - snapL3Length: 40 - IP length in header: 1500
Processing file: /home/wurst/data/S/annoloc2S.pcap1
Processing file: /home/wurst/data/S/annoloc2S.pcap2
Processing file: /home/wurst/data/S/annoloc2S.pcap3
Processing file: /home/wurst/data/S/annoloc2S.pcap4
Processing file: /home/wurst/data/S/annoloc2S.pcap5
Processing file: /home/wurst/data/S/annoloc2S.pcap6
Processing file: /home/wurst/data/S/annoloc2S.pcap7
Processing file: /home/wurst/data/S/annoloc2S.pcap8
^C[INF] SIGINT: Stop flow creation: 0x0002
Processing file: /home/wurst/data/S/annoloc2S.pcap
Processing file: /home/wurst/data/S/annoloc2S.pcap1
Processing file: /home/wurst/data/S/annoloc2S.pcap2
Processing file: /home/wurst/data/S/annoloc2S.pcap3
Processing file: /home/wurst/data/S/annoloc2S.pcap4
Processing file: /home/wurst/data/S/annoloc2S.pcap5
Processing file: /home/wurst/data/S/annoloc2S.pcap6
Processing file: /home/wurst/data/S/annoloc2S.pcap7
Processing file: /home/wurst/data/S/annoloc2S.pcap8
Processing file: /home/wurst/data/S/annoloc2S.pcap
Processing file: /home/wurst/data/S/annoloc2S.pcap1
Processing file: /home/wurst/data/S/annoloc2S.pcap2
Processing file: /home/wurst/data/S/annoloc2S.pcap3
^C[INF] SIGINT: Stop flow creation: 0x0001
Dump stop : 1022171713.457599 sec (Thu 23 May 2002 16:35:13 GMT)
Total dump duration: 11.766427 sec
Finished processing. Elapsed time: 1.411131 sec
Finished unloading flow memory. Time: 1.609616 sec
...

Interrupt it with ^C^C or send a t2stat -TERM command from another bash window.

Split output files

As with pcaps you can split flow files into smaller chunks, either measured in bytes or number of flows. The general command line option is defined as follows:

t2 -W PREFIX[:SIZE][,START]

The expression before the : defines the output file name prefix, the expression following denotes the maximal file size for each fragment; if omitted if defaults to OFRWFILELN defined in tranalyzer.h

tranalyzer2

vi src/tranalyzer.h

// -W option parameters
#define OFRWFILELN 5E8 // default fragmented output file length (500MB)

START defines the index of the first file generated. If omitted it defaults to 0.

The SIZE of the files can be specified in bytes (default), KB (K), MB (M) or GB (G). Scientific notation, i.e., 1e5 or 1E5 (=100000), can be used as well. If no size is specified, then the : can be omitted.

If a f is appended the unit is flow count. Hence, file chunks are produced containing the same amount of flows. Some typical examples are shown below.

Command Fragment Start Index Output Files
t2 -r ~/data/annoloc2.pcap -W ~/results/out:1.5E9,10 1.5 GB 10 out10, out11, …
t2 -r ~/data/annoloc2.pcap -W ~/results/out:1.5e9,5 1.5 GB 5 out5, out6, …
t2 -r ~/data/annoloc2.pcap -W ~/results/out:1.5G,1 1.5 GB 1 out1, out2, …
t2 -r ~/data/annoloc2.pcap -W ~/results/out:5000K 0.5 MB 0 out0, out1, …
t2 -r ~/data/annoloc2.pcap -W ~/results/out:5Kf 5000 flows 0 out0, out1, …
t2 -r ~/data/annoloc2.pcap -W ~/results/out:2.5G 2.5 GB 0 out0, out1, …
t2 -r ~/data/annoloc2.pcap -W ~/results/out,6 OFRWFILELN 0 out6, out7, …
t2 -r ~/data/annoloc2.pcap -W ~/results/out OFRWFILELN 0 out0, out1, …

Try them out and see what happens. Although being useful in production it is advisable to reset the round robin mode from the last chapter otherwise you end up in a loop with files constantly being overwritten.

t2conf tranalyzer2 -D RROP=0 && t2build tranalyzer2

A prominent application in productive environments is a combination of the -D and -W option as shown below, with max 1000 flows per file and with the devil start index 666:

t2 -D ~/data/S/annoloc2S.pcap,8 -W ~/results/F/:1000f,666

ls ~/results/F

annoloc2S_flows.txt666  annoloc2S_flows.txt669  annoloc2S_flows.txt672  annoloc2S_flows.txt675  annoloc2S_flows.txt678  annoloc2S_flows.txt681  annoloc2S_headers.txt
annoloc2S_flows.txt667  annoloc2S_flows.txt670  annoloc2S_flows.txt673  annoloc2S_flows.txt676  annoloc2S_flows.txt679  annoloc2S_flows.txt682
annoloc2S_flows.txt668  annoloc2S_flows.txt671  annoloc2S_flows.txt674  annoloc2S_flows.txt677  annoloc2S_flows.txt680  annoloc2S_flows.txt683

How to process several different files

Often a multitude of different pcaps uncorrelated in time and source have to be processed in the background. For that you better write a script yourself. Here is an example:

#!/usr/bin/env bash

if [ -z "$1" ]; then
    echo "Usage: $0 filename extension startIndex endIndex"
    exit 1
fi

EXT=$2
START=$3
END=$4
for ((i=$START; i<=$END; i++)) do
    rfile="$HOME/data/$1$i.$EXT"
    wfile="$HOME/results/$1$i"
    echo "Processing '$rfile', writing to '$wfile'"
    if [ -f "$rfile" ]; then
        t2 -r "$rfile" -w "$wfile"
    fi
done

Conclusion

Make sure that the polling timeout and round robin mode is reset for the following tutorials, if not already done earlier.

t2conf --reset tranalyzer2

Have fun and may the anteater be with you!