Category Archives: Python

Dune 2, .PAK files unPAKer in python.

about 15 years ago i wrote an unpacker for dune 2 .pak files (in pascal). i wanted to have the .voc (creative voice files).i decided to take a couple of hours to figure out (again) the format for these pak files and write a python script to unpack them. its probably useless these days but maybe someone can use it ūüôā

import struct
import os

def unPAK(filename):
    print "unPAKING:", filename
    fnlist = []
    filesize = os.path.getsize(filename)
    f = open(filename,"rb")
    count = 0;
    size = f.read(4)
    beginData =  struct.unpack('i', size)[0]
    print beginData
    data = f.read(beginData-4)
    data = filter(None, data.split('\x00'))
    print data
    fnlist.append((data[0],beginData))
    for i in range(1,len(data)):
        if ((i % 2) == 1):
            if len(data[i]) == 1:
                print i, data[i]
                pos = struct.unpack('i',data[i]+"\x00\x00\x00")
            elif len(data[i]) == 2:
                print i, data[i]
                pos = struct.unpack('i',data[i]+"\x00\x00")
            elif len(data[i]) == 3:
                print i, data[i]
                pos = struct.unpack('i',data[i]+"\x00")
            fnlist.append((data[i+1],pos[0]))
    print fnlist
    for i in range(1,len(fnlist)):
          total = int(fnlist[i][1]) - int(fnlist[i-1][1])
          print "saving file: ",fnlist[i-1][0], "total bytes", total
          fdata = f.read(total)
          f2 = open(fnlist[i-1][0],"w")
          f2.write(fdata)
          f2.close() 

    #last file.
    print filesize
    total = filesize - int(fnlist[i][1])
    print "saving file: ",fnlist[i][0], "total bytes", total
    f2 = open(fnlist[i][0],"w")
    f2.write(f.read(total))
    f2.close()
    print fnlist

    f.close()

unPAK("DUNE.PAK")
unPAK("ENGLISH.PAK")
unPAK("FINALE.PAK")
unPAK("HARK.PAK")
unPAK("HERC.PAK")
unPAK("INTRO.PAK")
unPAK("INTROVOC.PAK")
unPAK("MENTAT.PAK")
unPAK("MERC.PAK")
unPAK("ORDOS.PAK")
unPAK("SCENARIO.PAK")
unPAK("SOUND.PAK")
unPAK("VOC.PAK")

#unPAK("ATRE.PAK")
#unPAK("XTRE.PAK")

Facebook puzzles – solutions (battleship + simon says w/ thrift)

A year ago i was looking for a job, Facebook was a valid choice. but before they would contact you, you would have to solve their puzzles. These puzzles have various difficulties, and i managed to solve quite a few, i am publishing some of my solutions here so that people can understand a little bit about thrift.

 

the first one is Simon says (the easiest) and it is only there to get you going with the networking protocol of thrift.

#!/usr/bin/env python
import sys
from random import randrange
sys.path.append('../gen-py')

import SimonSays 
from ttypes import *
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

# Make socket
transport = TSocket.TSocket('thriftpuzzle.facebook.com', 9030)  
#transport = TSocket.TSocket('localhost', 9090)

# Buffering is critical. Raw sockets are very slow
transport = TTransport.TBufferedTransport(transport)

# Wrap in a protocol
protocol = TBinaryProtocol.TBinaryProtocol(transport)

# Create a client to use the protocol encoder
client = SimonSays.Client(protocol)

# Connect!
transport.open()

#Player one or two (two is run via command line) 
if len(sys.argv) <= 1:
    emailAddress = 'blah@gmail.com'
    gameID = client.registerClient(emailAddress)
    print 'gameID =', gameID
else:
    joinstatus = False
    emailAddress = 'blah2@gmail.com'    
    try:
        gameID = int(sys.argv[1])
        print gameID,'from command argument'
        joinstatus = client.join(gameID, emailAddress)
    except DuplicateEmailException:
        print "bad email"

    if joinstatus == False:
        print "Failed to join game: " + str(gameID)
        sys.exit(1)

endT = False
while endT==False:
    listC = client.startTurn()
    print 'list',listC
    for color in listC:
        print 'c',color
        res = client.chooseColor(color)
        print res
    endT = client.endTurn()
print client.winGame() 
# Close!
transport.close()

The second one is a little harder, it was the classic battleship game. if you look around the web hard enough you will find all the best tactics to win ūüôā

import sys
from random import randrange
sys.path.append('../gen-py')

import Battleship2
#from battleship2 import Battleship2
from ttypes import *
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

class Player:
    attackPattern = []
    attackNextCellIndex = 0
    attackPatternLength = 0
    emailAddress = ''
    hitLocations = {}
    state = 0 #0 is search,  1 is try to sink via stack
    hitStack = []
    roundRobin = [(0,-1),(0,1),(-1,0),(1,0)]
    currentRobin = roundRobin[:]
    
    def __init__(self,client,emailAddress):
        self.client = client
        self.placePieces()
        self.emailAddress = emailAddress
        print 'finished placing', emailAddress   
        self.createPattern()
        
    def placePieces(self):
        for piece in xrange(1, 6):
            while not self.client.placePiece(piece, 
                         Coordinate(randrange(10), randrange(10)), bool(randrange(2))):
                pass  
    
    def createPattern(self): 
        for i in range(0,10):
            for j in range(0,10):
                if (j % 2 ==0):
                    if (i % 2 ==0):
                        self.attackPattern.append((i,j))
                else:
                    if (i % 2 ==1):
                        self.attackPattern.append((i,j))
        print self.attackPattern
        self.attackPatternLength = len(self.attackPattern)
    
    def changeStateHit(self):        
        self.state = 1 
        
    def changeStateSearch(self):       
        self.state = 0    
    
    def resetRobin(self):
        self.currentRobin = self.roundRobin[:] 
                    
    def addHitStack(self,result,x,y):
        if result==6:
            self.hitStack.append((x,y))
            self.changeStateHit()   
            
    def attackCell(self,x,y):
        result =  self.client.attack(Coordinate(x,y))
        if result in [1,2,3,4]:
            print 'sunk ship=>',result
        return result
        
    def attackNext(self):
        print self.attackNextCellIndex,'my turn, i am gonna kill this guy'
        x,y = self.attackPattern[self.attackNextCellIndex]
        if self.attackNextCellIndex0:
                    "get next most probable cell (for now its the first)"
                    (a,b) = self.currentRobin.pop()
                    "if cell between bounds of grid"
                    if a+x>=0 and a+x<=9 and b+y>=0 and b+y<=9:
                        result = self.attackCell(x+a,y+b)
#                        if result==6:
#                            self.resetRobin()
                        self.addHitStack(result, x+a, y+b)
                        foundNextCell = True
        "may need to put code that knows orientation of ship "
        "if there are two shots next to each other"            
                 
    def hitNext(self):
        "keep searchin"
        if self.state==0: 
            self.attackNext()                    
        else:
            "attack known areas"
            self.attackFirstInStack()
        

# Make socket
transport = TSocket.TSocket('thriftpuzzle.facebook.com', 9031)  
#transport = TSocket.TSocket('localhost', 9090)

# Buffering is critical. Raw sockets are very slow
transport = TTransport.TBufferedTransport(transport)

# Wrap in a protocol
protocol = TBinaryProtocol.TBinaryProtocol(transport)

# Create a client to use the protocol encoder
client = Battleship2.Client(protocol)

# Connect!
transport.open()

    
#Player one or two (two is run via command line) 
if len(sys.argv) <= 1:
    emailAddress = 'blah@gmail.com'
    gameID = client.registerClient(emailAddress)
    print 'gameID =', gameID
else:
    joinstatus = False
    emailAddress = 'blah2@gmail.com'    
    try:
        gameID = int(sys.argv[1])
        print gameID,'from command argument'
        joinstatus = client.join(gameID, emailAddress)
    except DuplicateEmailException:
        print "bad email"

    if joinstatus == False:
        print "Failed to join game: " + str(gameID)
        sys.exit(1)
      
#create a player
player = Player(client,emailAddress)            

totalMoves = 0
while totalMoves<1000:
    totalMoves+=1
    try:
        isTurn = client.isMyTurn()
    except GameOverException:
        break
    if isTurn:    
        player.hitNext()
        print client.winGame() 
# Close!
transport.close()

svmLight, a Python Script that Compute the weight vector of linear SVM based on the model file

While working on my Thesis i had to get the features’ weights from the SVM Model. Thorsten Joachims published a perl script but i was using Python, i rewrote his script in python and he had graciously put a Download Link on his website.

You can find the original Perl script here: http://www.cs.cornell.edu/people/tj/svm_light/svm_light_faq.html
And the Python Script Here:  http://www.cs.cornell.edu/people/tj/svm_light/svm2weight.py.txt

Using this script will get you all the features’ weights. this is incredibly useful later on,

you can systematically eliminate features, as follows:
  • After training on all current features, select K% with highest SVM weight and K% with lowest (most negative) SVM weights
  • Iterate

you will notice that you can get higher prediction result with only a subset of your features.

* if you use this script in your publication or commercial product please credit me

 

# Compute the weight vector of linear SVM based on the model file
# Original Perl Author: Thorsten Joachims (thorsten@joachims.org)
# Python Version: Ori Cohen (orioric@gmail.com)
# Call: python svm2weights.py svm_model

import sys
from operator import itemgetter

try:
    import psyco
    psyco.full()
except ImportError:
    print 'Psyco not installed, the program will just run slower'

def sortbyvalue(d,reverse=True):
    ''' proposed in PEP 265, using  the itemgetter this function sorts a dictionary'''
    return sorted(d.iteritems(), key=itemgetter(1), reverse=True)

def sortbykey(d,reverse=True):
    ''' proposed in PEP 265, using  the itemgetter this function sorts a dictionary'''
    return sorted(d.iteritems(), key=itemgetter(0), reverse=False)

def get_file():
    """
    Tries to extract a filename from the command line.  If none is present, it
    assumes file to be svm_model (default svmLight output).  If the file
    exists, it returns it, otherwise it prints an error message and ends
    execution.
    """
    # Get the name of the data file and load it into
    if len(sys.argv) < 2:
        # assume file to be svm_model (default svmLight output)
        print "Assuming file as svm_model"
        filename = 'svm_model'
        #filename = sys.stdin.readline().strip()
    else:
        filename = sys.argv[1]

    try:
        f = open(filename, "r")
    except IOError:
        print "Error: The file '%s' was not found on this system." % filename
        sys.exit(0)

    return f

if __name__ == "__main__":
    f = get_file()
    i=0
    lines = f.readlines()
    printOutput = True
    w = {}
    for line in lines:
        if i>10:
            features = line[:line.find('#')-1]
            comments = line[line.find('#'):]
            alpha = features[:features.find(' ')]
            feat = features[features.find(' ')+1:]
            for p in feat.split(' '): # Changed the code here.
                a,v = p.split(':')
                if not (int(a) in w):
                    w[int(a)] = 0
            for p in feat.split(' '):
                a,v = p.split(':')
                w[int(a)] +=float(alpha)*float(v)
        elif i==1:
            if line.find('0')==-1:
                print 'Not linear Kernel!\n'
                printOutput = False
                break
        elif i==10:
            if line.find('threshold b')==-1:
                print "Parsing error!\n"
                printOutput = False
                break

        i+=1
    f.close()

    #if you need to sort the features by value and not by feature ID then use this line intead:
    #ws = sortbyvalue(w) 

    ws = sortbykey(w)
    if printOutput == True:
        for (i,j) in ws:
            print i,':',j
            i+=1

 

python import shelve when dictionary doesnt have enough memory.

I like using dictionaries in python, sometime i abuse then and load an obscene amount of data into them. what is good about dictionaries in python (hashmaps in other languages) is that you are able to access (write/read) any cell in O(1), and since its memory based it is super fast.

One of my Machine Learning algorithms relied solely on dictionary manipulations. i keep adding data into dictionary records, so you can imagine what happens after i loaded roughly 2gb of data when i only have 3gb of ram. dictionaries do not use virtual space so basically what you have is the amount of free RAM space available to you.

However you can always manipulate your data in other forms and probably rewrite your entire algorithm to use some kind of other more efficient data structure, but time is money and i didnt want to reinvent the wheel all over again.

I knew python has every solution available to men kind, but i just had to look for it. i was  aware of cPickle, but i wanted to find something that behaves like a dictionary but uses the hard drive space. wasnt long before i found out about shelve. so which one is the one i need?

A google search lead me to this page, the guy who posted there gives the gist of it all.

  • Use¬†Shelve¬†when you have a large amount of data that you only need small parts of at a time.
  • Use cPickle when you have data that you want to access all at once.
  • Basically between¬†Shelve¬†and cPickle, you are trading disk-access speed for in-memory-access speed.

It felt like shelve was what i needed. so what shelve is ?

A ‚Äúshelf‚ÄĚ is a persistent, dictionary-like object. The difference with ‚Äúdbm‚ÄĚ databases is
that the values (not the keys!) in a shelf can be essentially arbitrary Python objects
‚ÄĒ anything that the pickle module can handle. This includes most class instances,
recursive data types, and objects containing lots of shared sub-objects.
The keys are ordinary strings.

To make things short, i managed to solve my problem, with good speed and memory consumption. the trick is to load a shelve like so:

shelve.open(name,'n',writeback=True)

Now you need to remember that if you do writeback=True. you are still reading stuff to your RAM in the form of a shelve cache. unless you do dict.sync() or dict.close() and then it writes everything to the HD and cleans up the cache. (if you start putting more data into the dict, it will accumulate all the dict[key]=value that you have used from that point and until you sync() again)

A shelve is not 100% compatible with your old dictionary code, there is one change that you need to take care of, the following is taken from the python document and explains it all:

# as d was opened WITHOUT writeback=True, beware:

d['xx'] = range(4)  # this works as expected, but...
d['xx'].append(5)   # *this doesn't!* -- d['xx'] is STILL range(4)!

# having opened d without writeback=True, you need to code carefully:
temp = d['xx']      # extracts the copy
temp.append(5)      # mutates the copy
d['xx'] = temp      # stores the copy right back, to persist it

d=shelve.open(filename,writeback=True) would let you just code,
d['xx'].append(5) and have it work as expected, BUT it would also,
consume more memory and make the d.close() operation slower. also

Ultimately this trick is combination of using most of your RAM and only writing to the HD when you realy MUST and this speeds up things a LOT.

So how do you know when to write ?

You need a python library to monitor how much RAM is in total (self.total) and how much is available (self.available)..
lets say your available memory is 0.15 of the total ram you can sync() (save). this lets you fill up 85% of your memory each time before you do a dict.sync() (the writing of the cache to the db)

if float(self.available) / float(self.total) <= 0.15:
  startTime = time.time()
  print 'before total',self.total,'free:',self.available
  dict.sync()
  delay = 60
  time.sleep(delay)
  print 'after total:',self.total,'free:',self.available,
        'elapsed time:',(time.time() - startTime)-float(delay)

Do you see that delay up there, its another one of those dirty tricks that needs a better code to replace it, i needed it because when you start putting more data into the dict without waiting around 60 seconds, it will sync every second until the First dict.sync() command finished and python frees up all that used memory that the process is eating up, giving it back to the OS. this takes usually takes 10-60 seconds on my system. honestly this was a workaround to make sure everything works, but i wouldnt mind some suggestions on fixing it.

Bottom line this is almost was fast as just using RAM in some cases (unless you have enough RAM obviously).
also note that if you had a dict of 2GB of RAM, the amount of space you would use as a shelve DB is just 750MB.

When you are finished writing to the shelve. you might want to make sure the cache is clean by using sync(). if you are going to read dict items you need to remember that every item you read is indeed being placed in the cache, and again you might fill your memory to the limit, so sync must be used again.

There is one thing that i have always wanted to implement here. a systematic caching, that manages the K dictionary records that you are accessing more than every other key. and these K records will be kept in memory even when you run out of memory and sync everything to the DB. i suspect that this will boost speed a little, as you would have the ones you always need in RAM and always accessible and the ones you dont being dumped into the DB.