Skip to main content

Tried to open a very big CSV file in excel

Have you tried opening a very big csv file with excel?

 I tried to open a csv file with 5million rows of data using Microsoft Excel :

  • 1,048,576 rows limit . 1 million row limit ,I was aware of this. Not all the data was loaded , even got a notification modal stating this.
  • 32,767 character per cell , this I was not aware of. After opening a file which exceed this limit , new rows were displayed and looked like a mess. But the file was properly formatted when opened with notepad.
  • This one is obvious but formulas and filters were very slow (given the size of the data ,expected).


Why I was required to open a file with over 5million rows in the first place ?

I was actively trying to learn machine learning and tried to build dataset for supervised learning. I wanted to open the file to mark classes and values - for training classification and regression models.

Workarounds I did include :

  • filtered and removed currently unused rows . This cut the size by almost half.
  • split the files into smaller csv chunks by grouping the data
  • decoupled the features / parameters column into separate csv files
  • python pickled the smaller csv files because it had list data as strings and using ast.literal_eval() took a lot of time (a lot !) . Loading the pickled file was significantly much faster
After doing the above things , I was able to open the csv files using ms excel , mark and add values in much more organised way. 
These files will merged into one big file just before training , using pandas dataframe concat method.


Comments

Topics

Show more