Custom ImageDataGenerator() for half a million images where labels and pixels are in 2 separate DataFrames using Keras (or any other library) [closed]
I have 2 separate DataFrames
which contains pieces of information for around half a million images summing upto 6+ GBs. There are 4 .parquet
files which I had to pd.concat()
one by one to make a new DataFrame imgs
containing the pixels of 137*236
, values ranging from 0-32331
and the image's id column.
imgs
>>
image_id 0 1 ... 32330 32331
0 Train_50210 246 253 ... 251 250
1 Train_50211 250 245 ... 241 244
... ...
... ...
... ...
453651 Train_50210 0 253 ... 251 250
453652 Train_50211 250 245 ... 241 244
The second csv
contains the image's labels and the values of three different classes that each image belongs to. I imported the csv in train
.
train
>>
image_id class_1 class_2 class_3
0 Train_5 15 9 5
1 Train_1 159 0 0
...
...
...
453651 Train_342524 0 15 34
453652 Train_9534 18 0 7
Number of rows in train
are equal to rows in imgs
. It means that the Y-Labels of the images are stored in train
and the corresponding pixel attributes are in imgs
I tried merging both the the dataframes using pd.merge(imgs,train,on='image_id').drop('image_id')
and It took a long time and my kernal died every time while processing the above 2 steps. Please do suggest an alternate approach if there is any
Could somebody please tell me how to make a custom Data Generator for
1. producing batches of images
2. Augmented images for robustness
using keras
or any other library for fast processing.
Alternatively, could someone please tell me how to use ImageDataGenerator().flow() in my case
This is what i would suggest, load the dataframe, piece by piece, do not load the entirety of it at the same time, this might actually exceed your RAM, hence the dying kernel.
Then iterate through the dataframe line by line, take the 32332 columns, and reshape them into an image of 137x236 and save them to disk with a the appropriate name in to the folder train_data/class_number/, you can then use keras ImageDataGenerator().flowfromDirectory()
the issue is that the 32332 columns dont make sense to me, if the image was a single channel 137x236 image, then the number of columns would be 137*236 = 29972.So theres like 2k columns unaccounted for. Are you sure of the format of the data?