Working with big vDatabases

Valentina is a perfect choice to work with a big data. It employs vertical storage approach. It means – each column has own virtual file to be stored in. Most of db vendors use a horizontal storage – it stores a table data in record-by-record manner.

There are some vertical storage advantages:

  • There is no limit for a tuple length. In other words you may have up to 16 mln. fields in a table and you should not think of a record length at all. Horizontal storages practically always have some restrictions.
  • It’s more speedy in modifying the data. When you choose a record and want to update only couple of fields you should not rewrite all the record – only that fields data will be changed.
  • Also there is a huge speed up in getting data. Say you have a table with 100 fields but you want to get only couple of them – so you do something like this: “select f1, f2 from t1”. In this case vertical storage will read only two virtual files. In horizontal storage approach you have to read all the fields anyway.

Last days I tried to work with a pretty big db. Firstly I get the data from

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17203
Then import it in just created database using vStudio. It taked some time (~1 hour) but there was a debug vStudio build and ~6Gb of data. Finally we got a db – single table, 11 fields, ~120mln. records.
Then we played with it a bit:
  • Open/Close this db takes almost zero time.
  • “select * from t1” takes ~15sec. (Most of this time is for locking records. I think it would be times less using “no-lock” cursor)
  • Make some field nullable – takes almost zero time.
  • Indexing of string field (2Gb field data) – 35 minutes.
  • Searching with index – “select * from t1 where Allele1_Forward = ‘T'” – ~27mln records in result –  takes 18sec (again – most of the time is a locking records).
  • Similar searching without index – 586sec.
  • select count(GC_Score) from t1 where Allele1_Forward = ‘T’ – ~2sec (You see – there is no locking records and time reduces from 18sec to only 2sec).
  • select sum(GC_Score) from t1 where Allele1_Forward = ‘T’ – ~47sec
  • Converting string field to the double type – 20minutes
  • “select Allele1_Forward from t1 order by Allele1_Forward” – getting an exception – unable to allocate the memory for sorting data. Must be ok on 64-bit version.

Seems to be a pretty good results.