My hobbyist coding updates and releases as the mysterious "Mr. Tines"

Monday 27 April 2009

ZLIB and .Net

A recurring problem in several of the projects I have in the pipeline is the matter of handling ZLIB. Java, through java.util.zip offers ZLIB compression with a 32k byte window (but no means of tuning the window) with the DeflaterOutputStream. The .Net framework doesn't offer direct ZLIB at all, but provides naked Deflate via System.IO.Compression.DeflateStream.

That gives us enough to be able to reflate the output of a ZLIB deflation, since a ZLIB is a 2 byte header, a deflate section and finally a 4 byte checksum:

import clr    
clr.AddReference( 'vjslib, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' )
from java.io import *
from java.util.zip import *
from System.IO import *
from System.IO.Compression import *
from System import Array, Byte, SByte, Int64
from System.Text import Encoding

def readFileToSByteArray(name):
  # open a file via Java I/O
  raw = RandomAccessFile(name, 'r')
  # print "File size in bytes = %d "  % (raw.length())

  # read into signed byte array
  dim = Array[Int64]([raw.length()])
  helper = Array[SByte]([0])
  buffer = Array.CreateInstance(helper[0].GetType(), dim)
  raw.readFully(buffer)
  raw.close()
  return buffer
  
def sbyteArrayToUbyteArray(buffer, offset=0, length=-1):
  if length < 0:
    length = buffer.Length
 
  dim = Array[Int64]([length])
  helper = Array[Byte]([0])
  ubuffer = Array.CreateInstance(helper[0].GetType(), dim)
  
  # copy into unsigned byte array
  for i in range(0,length):
    ubuffer[i] = buffer[i+offset] & 0xff
  return ubuffer
  
def jdeflate(buffer):
  sink0 = ByteArrayOutputStream()
  sink = DeflaterOutputStream(sink0)
  sink.write(buffer, 0, buffer.Length)
  sink.close()
  sink0.close()
  return sink0.toByteArray()
  
def update_adler(adler, buffer):
  s1 = adler & 0xffff
  s2 = (adler >> 16) & 0xffff
  BASE = 65521
  for n in range(0,buffer.Length):
          s1 = (s1 + buffer[n]) % BASE
          s2 = (s2 + s1)     % BASE
  return (s2 << 16) + s1
  
def ninflate(buffer):
  mem = MemoryStream(buffer)
  inflate = DeflateStream(mem, CompressionMode.Decompress)

  mem = MemoryStream()
  while True:
    x = inflate.ReadByte()
    if x < 0:
      break
    mem.WriteByte(x)
 
  inflate.Close()
  mem.Close()
  return mem.ToArray()
  
 
##==========================================

# select a file
name = ...

# make signed, unsigned buffers and string
buffer = readFileToSByteArray(name)
print "Constructed buffer size %d" % (buffer.Length)
ubuffer = sbyteArrayToUbyteArray(buffer)
instring = Encoding.Default.GetString(ubuffer)

# compute the Adler32 checksum
adler = update_adler(1, ubuffer)
print "Adler32 = %d" % (adler)
 
# deflate in Java
deflated = jdeflate(buffer)

# check the ZLIB header -- expect 120, 156 actually (120,-100)
first = (deflated[0] & 0xff)
second = (deflated[1] & 0xff)
print "header value EXPECTED 120:156 ACTUAL %d:%d" % (first, second)

i = deflated.Length-4
x = ((deflated[i]&0xff) << 24) | ((deflated[i+1]&0xff) << 16) | ((deflated[i+2]&0xff) << 8) | (deflated[i+3]&0xff)
print "Java Adler32 = %d" % (x)

# discard header and Adler32 tail
newbuffer = sbyteArrayToUbyteArray(deflated, 2, deflated.Length-6)
print "ZLIB length %d" % (deflated.Length)
print "deflate section length %d" % (newbuffer.Length)

# reflate
out = ninflate(newbuffer)

# compare with input
print "reflated buffer length : %d" % (out.Length)
outstring = Encoding.Default.GetString(out)
same = outstring.Equals(instring)
print "Output == Input ?? %s" %(str(same))

That works fine; but the converse, taking a .Net deflate and adding the appropriate top and tail took a bit of getting to work.

First pitfall -- the InflaterInputStream has to be read in chunks as large as feasible, rather than byte at a time, so as not to throw a premature EOFException. That overcome, I got a result of the right length, but differing in the final few characters for the file under test, which I resolved by doing a belt-and-braces closing of the deflate operation. The un-refactored code I currently have continues from the above as:

# now the other way around
sink0 = MemoryStream()
sink = DeflateStream(sink0, CompressionMode.Compress)
sink.Write(ubuffer, 0, ubuffer.Length)
## overkill the flush and close here
sink.Flush()
sink.Close()
sink0.Flush()
sink0.Close()

deflated = sink0.ToArray()
print "deflate section length %d" % (deflated.Length)

#now inflate
dim = Array[Int64]([deflated.Length+6])
jbuffer = Array.CreateInstance(buffer[0].GetType(), dim)

jbuffer[0] = 120 # 0111 0100 = 120 : 32 kbit window, Deflate
jbuffer[1] = -100 # 10 0 ????? = 128 + x : default compress, no dict, checksum

def sbyte(x):
  y = x & 0xff
  if y > 127:
    return y - 256
  return y
 
for i  in range(0,deflated.Length):
    jbuffer[i+2] = sbyte(deflated[i])
 
i = deflated.Length+2
jbuffer[i] = sbyte(adler>>24) 
jbuffer[i+1] = sbyte(adler>>16) 
jbuffer[i+2] = sbyte(adler>>8)
jbuffer[i+3] = sbyte( adler )
 
 
source0 = ByteArrayInputStream(jbuffer)
source = InflaterInputStream(source0)

dim = Array[Int64]([ubuffer.Length])
helper = Array[SByte]([0])
xbuffer = Array.CreateInstance(helper[0].GetType(), dim)

## take big bites
offset = 0
try:
 while True:
  x = source.read(xbuffer, offset, xbuffer.Length-offset)
  if x < 0:
    break
  offset += x
except EOFException:
  pass

reflated = sbyteArrayToUbyteArray(xbuffer)

print "reflated length is %d" % (reflated.Length)
outstring = Encoding.Default.GetString(reflated)

same = outstring.Equals(instring)
print same

This should allow me to simplify the baggage accumulated for the C#/Erlang bridge, which currently uses a second-generation port of the original 'C' ZLIB for this sort of interoperation.

Much Later — I have ported this to F# on .net 4.5.



There's a lot of buffer to stream to buffer conversion involved -- the {n,j}{in,de}flate functions are just that. It's ndeflateZLIB, ninflateZLIBFull that do the complete transformation from a byte array to a compressed byte array and back again. The sbyte bit is purely a J# compatibility thing. Full solution here.