Custom ImageDataGenerator() for half a million images where labels and pixels are in 2 separate DataFrames using Keras (or any other library) [closed]

I have 2 separate DataFrames which contains pieces of information for around half a million images summing upto 6+ GBs. There are 4 .parquet files which I had to pd.concat() one by one to make a new DataFrame imgs containing the pixels of 137*236, values ranging from 0-32331 and the image's id column.

imgs
>>
          image_id     0      1  ...  32330  32331

0       Train_50210  246    253  ...    251    250   
1       Train_50211  250    245  ...    241    244
...                              ...
...                              ...
...                              ...
453651  Train_50210    0    253  ...    251    250   
453652  Train_50211  250    245  ...    241    244  

The second csv contains the image's labels and the values of three different classes that each image belongs to. I imported the csv in train.

train
>>

            image_id      class_1   class_2  class_3    

0            Train_5           15         9        5    
1            Train_1          159         0        0
...
...
...
453651  Train_342524             0       15       34
453652    Train_9534            18        0        7

Number of rows in train are equal to rows in imgs. It means that the Y-Labels of the images are stored in train and the corresponding pixel attributes are in imgs

I tried merging both the the dataframes using pd.merge(imgs,train,on='image_id').drop('image_id') and It took a long time and my kernal died every time while processing the above 2 steps. Please do suggest an alternate approach if there is any

Could somebody please tell me how to make a custom Data Generator for

1. producing batches of images
2. Augmented images for robustness

using keras or any other library for fast processing.

Alternatively, could someone please tell me how to use ImageDataGenerator().flow() in my case


This is what i would suggest, load the dataframe, piece by piece, do not load the entirety of it at the same time, this might actually exceed your RAM, hence the dying kernel.

Then iterate through the dataframe line by line, take the 32332 columns, and reshape them into an image of 137x236 and save them to disk with a the appropriate name in to the folder train_data/class_number/, you can then use keras ImageDataGenerator().flowfromDirectory()

the issue is that the 32332 columns dont make sense to me, if the image was a single channel 137x236 image, then the number of columns would be 137*236 = 29972.So theres like 2k columns unaccounted for. Are you sure of the format of the data?