I like using dictionaries in python, sometime i abuse then and load an obscene amount of data into them. what is good about dictionaries in python (hashmaps in other languages) is that you are able to access (write/read) any cell in O(1), and since its memory based it is super fast.
One of my Machine Learning algorithms relied solely on dictionary manipulations. i keep adding data into dictionary records, so you can imagine what happens after i loaded roughly 2gb of data when i only have 3gb of ram. dictionaries do not use virtual space so basically what you have is the amount of free RAM space available to you.
However you can always manipulate your data in other forms and probably rewrite your entire algorithm to use some kind of other more efficient data structure, but time is money and i didnt want to reinvent the wheel all over again.
I knew python has every solution available to men kind, but i just had to look for it. i was aware of cPickle, but i wanted to find something that behaves like a dictionary but uses the hard drive space. wasnt long before i found out about shelve. so which one is the one i need?
A google search lead me to this page, the guy who posted there gives the gist of it all.
- Use Shelve when you have a large amount of data that you only need small parts of at a time.
- Use cPickle when you have data that you want to access all at once.
- Basically between Shelve and cPickle, you are trading disk-access speed for in-memory-access speed.
It felt like shelve was what i needed. so what shelve is ?
A “shelf” is a persistent, dictionary-like object. The difference with “dbm” databases is
that the values (not the keys!) in a shelf can be essentially arbitrary Python objects
— anything that the pickle module can handle. This includes most class instances,
recursive data types, and objects containing lots of shared sub-objects.
The keys are ordinary strings.
To make things short, i managed to solve my problem, with good speed and memory consumption. the trick is to load a shelve like so:
Now you need to remember that if you do writeback=True. you are still reading stuff to your RAM in the form of a shelve cache. unless you do dict.sync() or dict.close() and then it writes everything to the HD and cleans up the cache. (if you start putting more data into the dict, it will accumulate all the dict[key]=value that you have used from that point and until you sync() again)
A shelve is not 100% compatible with your old dictionary code, there is one change that you need to take care of, the following is taken from the python document and explains it all:
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = range(4) # this works as expected, but...
d['xx'].append(5) # *this doesn't!* -- d['xx'] is STILL range(4)!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
d=shelve.open(filename,writeback=True) would let you just code,
d['xx'].append(5) and have it work as expected, BUT it would also,
consume more memory and make the d.close() operation slower. also
Ultimately this trick is combination of using most of your RAM and only writing to the HD when you realy MUST and this speeds up things a LOT.
So how do you know when to write ?
You need a python library to monitor how much RAM is in total (self.total) and how much is available (self.available)..
lets say your available memory is 0.15 of the total ram you can sync() (save). this lets you fill up 85% of your memory each time before you do a dict.sync() (the writing of the cache to the db)
if float(self.available) / float(self.total) <= 0.15:
startTime = time.time()
print 'before total',self.total,'free:',self.available
delay = 60
print 'after total:',self.total,'free:',self.available,
'elapsed time:',(time.time() - startTime)-float(delay)
Do you see that delay up there, its another one of those dirty tricks that needs a better code to replace it, i needed it because when you start putting more data into the dict without waiting around 60 seconds, it will sync every second until the First dict.sync() command finished and python frees up all that used memory that the process is eating up, giving it back to the OS. this takes usually takes 10-60 seconds on my system. honestly this was a workaround to make sure everything works, but i wouldnt mind some suggestions on fixing it.
Bottom line this is almost was fast as just using RAM in some cases (unless you have enough RAM obviously).
also note that if you had a dict of 2GB of RAM, the amount of space you would use as a shelve DB is just 750MB.
When you are finished writing to the shelve. you might want to make sure the cache is clean by using sync(). if you are going to read dict items you need to remember that every item you read is indeed being placed in the cache, and again you might fill your memory to the limit, so sync must be used again.
There is one thing that i have always wanted to implement here. a systematic caching, that manages the K dictionary records that you are accessing more than every other key. and these K records will be kept in memory even when you run out of memory and sync everything to the DB. i suspect that this will boost speed a little, as you would have the ones you always need in RAM and always accessible and the ones you dont being dumped into the DB.