Have you tried opening a very big csv file with excel?
I tried to open a csv file with 5million rows of data using Microsoft Excel :
- 1,048,576 rows limit . 1 million row limit ,I was aware of this. Not all the data was loaded , even got a notification modal stating this.
- 32,767 character per cell , this I was not aware of. After opening a file which exceed this limit , new rows were displayed and looked like a mess. But the file was properly formatted when opened with notepad.
- This one is obvious but formulas and filters were very slow (given the size of the data ,expected).
Why I was required to open a file with over 5million rows in the first place ?
I was actively trying to learn machine learning and tried to build dataset for supervised learning. I wanted to open the file to mark classes and values - for training classification and regression models.
Workarounds I did include :
- filtered and removed currently unused rows . This cut the size by almost half.
- split the files into smaller csv chunks by grouping the data
- decoupled the features / parameters column into separate csv files
- python pickled the smaller csv files because it had list data as strings and using ast.literal_eval() took a lot of time (a lot !) . Loading the pickled file was significantly much faster
After doing the above things , I was able to open the csv files using ms excel , mark and add values in much more organised way.
These files will merged into one big file just before training , using pandas dataframe concat method.
Comments
Post a Comment
Oof!