Windows Phone: performance of parsing different file types
Mood: chipper
Posted on 2014-01-09 22:26:00
Tags: windowsphone projects wpdev
Words: 1033
When I started to work on Baseball Odds I knew I was going to have to worry about performance - the data set I have for the win probability has right around 15000 records. So I thought it would be neat to compare different file formats and how long it took to read their data in. Each record had the inning number (with top or bottom), how many outs, what runners are on base, the score difference, and the number of situations and the number of times the current team won. Here's a brief description of each format and some sample code:
Text:
This was actually the format I already had the data in, as it matched Phil Birnbaum's data file format. A sample line looks like this:
and there are 15000 lines in the file. The code to parse this looks something like this:"H",1,0,1,0,81186,47975
const bool USE_AWAIT = false;
const bool CONFIGURE_AWAIT = false;
var resource = System.Windows.Application.GetResourceStream(
new Uri(@"Data\winProbs.txt", UriKind.Relative));
using (StreamReader sr = new StreamReader(resource.Stream))
{
string line;
if (USE_AWAIT)
{
if (CONFIGURE_AWAIT)
{
line = await sr.ReadLineAsync().ConfigureAwait(false);
}
else
{
line = await sr.ReadLineAsync();
}
}
else
{
line = sr.ReadLine();
}
while (line != null)
{
var parts = line.Split(',');
bool isHome = (parts[0] == "\"H\"");
_fullData.Add(new Tuple<bool, byte, byte, byte, sbyte>(
isHome, byte.Parse(parts[1]), byte.Parse(parts[2]), byte.Parse(parts[3]),
sByte.Parse(parts[4])),
new Tuple<UInt32, UInt32>(UInt32.Parse(parts[5]), UInt32.Parse(parts[6])));
if (USE_AWAIT)
{
if (CONFIGURE_AWAIT)
{
line = await sr.ReadLineAsync().ConfigureAwait(false);
}
else
{
line = await sr.ReadLineAsync();
}
}
else
{
line = sr.ReadLine();
}
}
}
(what are USE_AWAIT
and CONFIGURE_AWAIT
all about? See the results below...)
JSON:
To avoid having to write my own parsing code, I decided to write the data in a JSON format and use Json.NET to parse it. One line of the data file looks like this:
{isHome:1,inning:1,outs:0,baserunners:1,runDiff:0,numSituations:81186,numWins:47975}
This is admittedly a bit verbose, and it makes the file over a megabyte. The parsing code is simple, though:
var resource = System.Windows.Application.GetResourceStream(
new Uri(@"Data\winProbs.json", UriKind.Relative));
using (StreamReader sr = new StreamReader(resource.Stream))
{
string allDataString = await sr.ReadToEndAsync();
JArray allDataArray = JArray.Parse(allDataString);
for (int i = 0; I < allDataArray.Count; ++i)
{
JObject dataObj = (JObject)(allDataArray[i]);
_fullData.Add(new Tuple<bool, byte, byte, byte, sbyte>(
(int)dataObj["isHome"] == 1, (byte)dataObj["inning"],
(byte)dataObj["outs"], (byte)dataObj["baserunners"], (sbyte)dataObj["runDiff"]),
new Tuple<UInt32, UInt32>((UInt32)dataObj["numSituations"],
(UInt32)dataObj["numWins"]));
}
}
After I posted this, Martin Suchan pointed out that using JsonConvert
might be faster, and even wrote some code to try it out.
Binary:
To try to get the file to be as small as possible (which I suspected correlated with parsing time), I converted the file to a custom binary format. Here's my textual description of the format:
UInt32 = total num records
UInt32 = num of records that have UInt32 for num situations
(these come first)
each record is:
UInt8 = high bit = visitor=0, home=1
rest is inning (1-26)
UInt8 = high 2 bits = num outs (0-2)
rest is baserunners (1-8)
Int8 = score diff (-26 to 27)
UInt32/UInt16 = num situations
UInt16 = num of wins
To format the file this way, I had to write a Windows 8 app that read in the text file and wrote out the binary version using a BinaryWriter with the Write(Byte), etc. methods. Here's the parsing code:
var resource = System.Windows.Application.GetResourceStream(
new Uri([@"Data\winProbs.bin", UriKind.Relative));
using (var br = new System.IO.BinaryReader(resource.Stream))
{
UInt32 totalRecords = br.ReadUInt32();
UInt32 recordsWithUInt32 = br.ReadUInt32();
for (UInt32 i = 0; i < totalRecords; ++i)
{
byte inning = br.ReadByte();
byte outsRunners = br.ReadByte();
sbyte scoreDiff = br.ReadSByte();
UInt32 numSituations = (i < recordsWithUInt32) ? br.ReadUInt32() : br.ReadUInt16();
UInt16 numWins = br.ReadUInt16();
_compressedData.Add(new Tuple<byte, byte, sbyte>(inning, outsRunners, scoreDiff),
new Tuple<uint, ushort>(numSituations, numWins));
}
}
Results:
Without further ado, here are the file sizes and how long the files took to read and parse (running on my Lumia 1020):
Type | File size | Time to parse |
---|---|---|
Text (USE_AWAIT =true)( CONFIGURE_AWAIT =false) | 278K | 4.8 secs |
Text (USE_AWAIT =true)( CONFIGURE_AWAIT =true) | 278K | 0.4 secs |
Text (USE_AWAIT =false) | 278K | 0.4 secs |
JSON (parsing one at a time) | 1200KB | 3.2 secs |
JSON (using JsonConvert ) | 1200KB | 1.3 secs |
Binary | 103KB | 0.15 secs |
await
ing 15000 calls, as this increased the time to parse the text file from 0.4 secs to 4.8 secs! Not hugely surprising, but something to keep in mind - if a call is going to be very short and you're going to be doing it many times, try not await
ing it. If you want to use await
, you can call ConfigureAwait(false)
to not force the continuation back into its original context - this seems to almost entirely eliminate the overhead. For more information, I'd recommend the article Best Practices in Asynchronous Programming.JsonConvert
cut down on the time significantly. I had always avoided doing that because of the pain of declaring a class, but I'll definitely do it in the future!await
caveat in mind)This backup was done by LJBackup.