Classifying Fruits
Looking at images of fruits with various machine resnets and more, and training models to recognize the type of fruit.
View original project on GitHub →Classifying Fruits¶
This notebook is a quick intro into a machine vision model fine tuning for classifying fruits
First, let us start by downloading a dataset from kaggle
# # this is being run on paperspace, and setting up environements is a bit painful, hence:
# !pip install --upgrade huggingface_hub
# !pip install kagglehub
# !pip install timm
# !pip install pynvml
import kagglehub
import timm
path = kagglehub.dataset_download("icebearogo/fruit-classification-dataset")
print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.8), please consider upgrading to the latest version (0.3.12). Path to dataset files: /Users/ignacybartnik/.cache/kagglehub/datasets/icebearogo/fruit-classification-dataset/versions/1
And now, let us move the data into this repo
from tqdm import tqdm
import shutil
import os
destination_dir = os.getcwd()
all_files = []
for root, _, files in os.walk(path):
for file in files:
src = os.path.join(root, file)
rel = os.path.relpath(root, path)
dst = os.path.join(destination_dir, rel, file)
all_files.append((src, dst))
for src_path, dest_path in tqdm(all_files, desc="Copying dataset files"):
os.makedirs(os.path.dirname(dest_path), exist_ok=True)
if not os.path.exists(dest_path):
shutil.copy2(src_path, dest_path)
Copying dataset files: 100%|██████████| 50006/50006 [00:12<00:00, 3997.34it/s]
Creating a Labeled Test Set from the Training Data¶
The original test set does not have ground truth labels. To evaluate the model, we will:
- Randomly select a subset of the training set.
- Remove these samples from the training set.
- Move the corresponding image files to a new folder called
good_test. - Create a new CSV file
good_test.csvwith the image paths and labels.
import pandas as pd
import numpy as np
from pathlib import Path
import shutil
import os
# Set random seed for reproducibility
np.random.seed(42)
# Paths
data_path = Path(os.getcwd())/'Fruit_dataset'
train_csv = data_path/'train.csv'
good_test_dir = data_path/'good_test'
good_test_csv = data_path/'good_test.csv'
# Parameters
n_good_test = 200 # Number of samples to move to good_test
# Load train.csv
train_df = pd.read_csv(train_csv)
# Randomly select samples for good_test
good_test_df = train_df.sample(n=n_good_test, random_state=42)
# Remove selected samples from train_df
remaining_train_df = train_df.drop(good_test_df.index)
# Save updated train.csv
remaining_train_df.to_csv(train_csv, index=False)
# Create good_test directory if it doesn't exist
os.makedirs(good_test_dir, exist_ok=True)
# Move files and update paths in good_test_df
def move_and_update(row):
src = data_path/row['image:FILE']
dst = good_test_dir/src.name
shutil.move(str(src), str(dst))
return dst.relative_to(data_path).as_posix()
good_test_df['image:FILE'] = good_test_df.apply(move_and_update, axis=1)
# Save good_test.csv
good_test_df.to_csv(good_test_csv, index=False)
print(f"Moved {n_good_test} images to {good_test_dir} and created {good_test_csv}")
import pandas as pd
import numpy as np
from pathlib import Path
import shutil
import os
import fastai.vision.all as fva
from pathlib import Path
import pandas as pd
import os
fva.set_seed(42)
path = Path(os.getcwd())
path = path/'Fruit_dataset'
good_test_df = pd.read_csv(path/'good_test.csv')
# Load class names from classname.txt, strip whitespace, and build mapping
with open(path/'classname.txt', 'r') as f:
classnames = [line.strip() for line in f.readlines()]
category_to_class = dict(enumerate(classnames))
good_test_df['label'] = good_test_df['category'].apply(lambda x: category_to_class[x], None)
Now, let us import some functions for handling this data, and also lets manages some paths
import fastai.vision.all as fva
from pathlib import Path
import pandas as pd
import os
fva.set_seed(42)
path = Path(os.getcwd())
path = path/'Fruit_dataset'
Looking at the data¶
The data has be very helpfully separated out into a train, validation and test set.
Let take a look at some of the train images
train_df = pd.read_csv(path/'train.csv')
train_df.head()
| image:FILE | category | |
|---|---|---|
| 0 | train/oil_palm/57.jpg | 0 |
| 1 | train/oil_palm/881.jpg | 0 |
| 2 | train/oil_palm/450.jpg | 0 |
| 3 | train/oil_palm/28.jpg | 0 |
| 4 | train/oil_palm/62.jpg | 0 |
img = fva.PILImage.create(path/train_df['image:FILE'].iloc[3243])
print(img.size)
img.to_thumb(128)
(390, 260)
Lets check the sizes of these images
import fastcore.parallel as fp
def f(o):
try:
return fva.PILImage.create(o).size
except Exception as e:
return None
files = [path/f for f in train_df["image:FILE"].tolist()]
sizes = fp.parallel(f, files, n_workers=8, progress=True)
pd.Series(sizes).value_counts()
(390, 260) 3999
(225, 225) 1271
(275, 183) 1039
(300, 300) 969
(259, 194) 892
...
(370, 136) 1
(268, 184) 1
(191, 161) 1
(314, 179) 1
(143, 201) 1
Length: 5917, dtype: int64
dls = fva.ImageDataLoaders.from_folder(path, train='train', valid='val',
item_tfms=fva.Resize(390, method='squish'),
batch_tfms=fva.aug_transforms(size=128, min_scale=0.75))
dls.show_batch(max_n=6)
import huggingface_hub
print(huggingface_hub.__version__)
0.33.2
learn = fva.vision_learner(dls, 'resnet26d', metrics=fva.error_rate, path='.').to_fp16()
learn.lr_find(suggest_funcs=(fva.valley, fva.slide))
SuggestedLRs(valley=0.002511886414140463, slide=0.0020892962347716093)
The learning rate suggested by valley and slide methods tend to be conservative, lets try something slightly larger as an initial attempt
learn.fine_tune(1, 0.03)
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 2.192565 | 1.812227 | 0.481600 | 01:31 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.128546 | 0.881743 | 0.262200 | 01:35 |
Evaluate Model on Test Set¶
We will now evaluate the model on the test set. This will generate a CSV file containing all test images (in alphabetical order) and the predicted label for each one.
# Get test image files in alphabetical order
test_dir = path/'good_test'
tst_files = fva.get_image_files(test_dir).sorted()
# Create test dataloader
tst_dl = dls.test_dl(tst_files)
# Get predictions
probs, _, idxs = learn.get_preds(dl=tst_dl, with_decoded=True)
# Map predicted indices to class names
mapping = dict(enumerate(dls.vocab))
pred_labels = pd.Series(idxs.numpy(), name="idxs").map(mapping)
# Create DataFrame with filenames and predicted labels
pred_df = pd.DataFrame({
'image:FILE': [f.parent.name + '/' + f.name for f in tst_files],
'predicted_label': pred_labels
})
# Save to CSV
pred_df.to_csv('test_predictions.csv', index=False)
print('Saved test_predictions.csv with predicted labels.')
Saved test_predictions.csv with predicted labels.
# Merge on image name
merged = pd.merge(good_test_df, pred_df, on='image:FILE', how='inner')
# Compare predicted and true labels
correct = merged['label'] == merged['predicted_label']
accuracy = correct.mean()
print(f'Accuracy on good_test set: {accuracy:.4f} ({correct.sum()}/{len(correct)})')
Accuracy on good_test set: 0.0050 (1/200)
Fast iterating¶
Okay, we got an accuracy of 73.5% with training for roughly 7.5 minutes, with little code for the actual ML part of the excercise. That's pretty good for the time dedicated but lets see if we can do slightly better.
We could always train for longer, but, lets consider that resnet26d, although good, is from 2019, and there has been plenty of development since then.
First let us make some functions to reduce repeating code:
def train(arch, item, batch, lr = 0.03, epochs=5):
dls = fva.ImageDataLoaders.from_folder(path, train='train', valid='val', item_tfms=item, batch_tfms=batch)
learn = fva.vision_learner(dls, arch, metrics=fva.error_rate, path='.').to_fp16()
learn.fine_tune(epochs, lr)
return learn
def evaluate(learn, tta=False):
if tta:
probs,_ = learn.tta(dl=tst_dl)
idxs = probs.argmax(dim=1)
labels = [learn.dls.vocab[i] for i in idxs]
else:
probs, _, idxs = learn.get_preds(dl=tst_dl, with_decoded=True)
mapping = dict(enumerate(dls.vocab))
pred_labels = pd.Series(idxs.numpy(), name="idxs").map(mapping)
pred_df = pd.DataFrame({
'image:FILE': [f.parent.name + '/' + f.name for f in tst_files],
'predicted_label': pred_labels
})
merged = pd.merge(good_test_df, pred_df, on='image:FILE', how='inner')
correct = merged['label'] == merged['predicted_label']
accuracy = correct.mean()
print(f'Accuracy on good_test set: {accuracy:.4f} ({correct.sum()}/{len(correct)})')
Lets see if initially resizing the images to smaller images can make training go faster and how much accuracy is lost
learn = train('resnet26d', item=fva.Resize(192, method='squish'),
batch=fva.aug_transforms(size=128, min_scale=0.75))
evaluate(learn)
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 2.303541 | 1.851372 | 0.503400 | 01:06 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.401922 | 1.160912 | 0.333000 | 01:08 |
| 1 | 1.206511 | 1.046775 | 0.293800 | 01:09 |
| 2 | 0.923521 | 0.846539 | 0.250800 | 01:05 |
| 3 | 0.602255 | 0.694226 | 0.208200 | 01:07 |
| 4 | 0.453858 | 0.678835 | 0.194800 | 01:10 |
Accuracy on good_test set: 0.7250 (145/200)
learn = train('resnet26d', item=fva.Resize(192, method='squish'),
batch=fva.aug_transforms(size=128, min_scale=0.75))
evaluate(learn)
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 2.222486 | 1.872126 | 0.494600 | 01:23 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.369784 | 1.179270 | 0.337400 | 01:21 |
| 1 | 1.218235 | 1.054768 | 0.299800 | 01:16 |
| 2 | 0.917864 | 0.869241 | 0.251800 | 01:23 |
| 3 | 0.607684 | 0.724099 | 0.209800 | 01:21 |
| 4 | 0.477654 | 0.689519 | 0.200400 | 01:16 |
Accuracy on good_test set: 0.7150 (143/200)
evaluate(learn)
Accuracy on good_test set: 0.7250 (145/200)
Okay, so that is faster, without impacting the error rate too much. During this training, my CPU utilization displayed 500%, while GPU usage peaked at 70% and was generally around 50%. This suggests that we can move to a larger model, so that we are less CPU bound. There is this helpful, although probably not quite up to date summary of models, lets try a convnext model.
learn = train('convnext_tiny.fb_in22k', item=fva.Resize(192, method='squish'),
batch=fva.aug_transforms(size=128, min_scale=0.75), epochs=3)
evaluate(learn)
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.368587 | 1.028263 | 0.271000 | 01:07 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.985354 | 0.777692 | 0.234200 | 01:13 |
| 1 | 0.645591 | 0.553403 | 0.174000 | 01:15 |
| 2 | 0.347420 | 0.463451 | 0.146400 | 01:14 |
Accuracy on good_test set: 0.7800 (156/200)
learn = train('convnext_tiny.fb_in22k', item=fva.Resize(192, method='squish'),
batch=fva.aug_transforms(size=128, min_scale=0.75), epochs=1)
evaluate(learn)
model.safetensors: 0%| | 0.00/178M [00:00<?, ?B/s]
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.297702 | 1.045727 | 0.276600 | 01:23 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.662976 | 0.538108 | 0.166200 | 01:26 |
Accuracy on good_test set: 0.7300 (146/200)
Cropping instead of squishing¶
Cool, we've retained our accuracy while reducing the number of epoches. Now lets try exploring some of the preprocessing option. Right now we are squishing our images down, what happens if we crop them instead?
learn = train('convnext_tiny.fb_in22k', item=fva.Resize(192),
batch=fva.aug_transforms(size=128, min_scale=0.75), epochs=3)
evaluate(learn)
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.308255 | 1.108780 | 0.291600 | 01:05 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.934579 | 0.736993 | 0.215600 | 01:13 |
| 1 | 0.637754 | 0.530580 | 0.162200 | 01:11 |
| 2 | 0.376314 | 0.455371 | 0.141600 | 01:18 |
Accuracy on good_test set: 0.8000 (160/200)
That is slightly better. Cool.
What about padding¶
Lets also try padding our images into rectangles instead of the other two methods
learn = train('convnext_tiny.fb_in22k', item=fva.Resize((256,192), method=fva.ResizeMethod.Pad, pad_mode=fva.PadMode.Zeros),
batch=fva.aug_transforms(size=(171,128), min_scale=0.75), epochs=3)
evaluate(learn)
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.339136 | 0.957432 | 0.259800 | 01:23 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.019285 | 0.740192 | 0.217600 | 01:35 |
| 1 | 0.648015 | 0.558406 | 0.171200 | 01:36 |
| 2 | 0.398889 | 0.459359 | 0.141800 | 01:32 |
Accuracy on good_test set: 0.7750 (155/200)
That is slightly worse than cropping. Let us go back to cropping in that case.
Back to cropping¶
learn = train('convnext_tiny.fb_in22k', item=fva.Resize(192, method='crop'),
batch=fva.aug_transforms(size=128, min_scale=0.75), epochs=3)
evaluate(learn)
model.safetensors: 0%| | 0.00/178M [00:00<?, ?B/s]
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.251283 | 0.946261 | 0.255600 | 01:21 |
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 0.918570 | 0.782189 | 0.226200 | 01:30 |
| 1 | 0.655618 | 0.528143 | 0.164000 | 01:30 |
| 2 | 0.373048 | 0.468195 | 0.147200 | 01:29 |
Accuracy on good_test set: 0.7700 (154/200)
TTA¶
We can also use test time augmentation for a small boost in performance
evaluate(learn,True)
Accuracy on good_test set: 0.7850 (157/200)
Larger models¶
Now lets try a some more models. These models are larger, so we have to implement gradient accumulation, which will let us reduce the GPU memory usage. Also, need to drop the learning rate, otherwise it becomes unstable and can start giving out nans.
Lets look at memory usage for different models. First, let us define a function that will show us GPU memory usage and also clear the cache.
import gc
import torch
def report_gpu():
print(torch.cuda.list_gpu_processes())
gc.collect()
torch.cuda.empty_cache()
report_gpu()
GPU:0 process 2080424 uses 2618.000 MB GPU memory
Training with accumulation¶
We can use gradient accumulation to reduce the amount of GPU memory needed to train a model. Lets define a new function that will allow us to do this. This way we can load some larger versions of the convnext model that we were using earlier, and hopefully give us better performance.
I'm also adding in a learners list, that will let us access the trained model after the cell was executed.
accum=4
finetune = False
epochs=1
lr=0.001
learners = []
def train_w_accumulation(arch, path, size, learners, epochs=5, accum=1, finetune=True, tta=False, tst_files=None, lr=0.001):
dls = fva.ImageDataLoaders.from_folder(
path, train='train', valid='val',
item_tfms=fva.Resize(390, method='crop'),
batch_tfms=fva.aug_transforms(size=size, min_scale=0.75),
bs=64//accum
)
cbs = [fva.GradientAccumulation(64)] if accum > 1 else []
learn = fva.vision_learner(dls, arch, metrics=fva.error_rate, cbs=cbs, path=path).to_fp16()
if finetune:
learn.fine_tune(epochs, lr)
else:
learn.unfreeze()
learn.fit_one_cycle(epochs, lr)
learners.append(learn)
if tta and tst_files:
return learn, learn.tta(dl=dls.test_dl(tst_files))
else:
return learn
report_gpu()
GPU:0 process 2080424 uses 2618.000 MB GPU memory
Now lets vary the accum parameter, until we don't get an out of memory error from the GPU. For this model, on this GPU, 4 seems to work.
train_w_accumulation('convnext_large_in22k', path, size=224, epochs=1, accum=4, finetune=False, lr = 0.03)
report_gpu()
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.283809 | 0.720400 | 0.209200 | 08:29 |
GPU:0 process 2312746 uses 13052.000 MB GPU memory
Notice that we got a pretty low error rate given we only trained for one epoch. This is good, but could not be reproduced. When running such a large model at this high of a learning rate, it can cause some instability, and we were getting nan outputs for every run after the one above, using the same settings.
Best model so far¶
Let us now train a large version of the convnext model, with cropping, and TTA for inference. This should give us the best result so far. As mentioned above, we have to drop down the learning rate for it to be stable, but, hopefully with more epochs, we can get a good result.
learners=[]
train_w_accumulation('convnext_large.fb_in22k', path, (320,240), learners, epochs=5, accum=4, finetune=False, lr=0.001)
report_gpu()
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 1.378413 | 0.962394 | 0.270000 | 11:29 |
| 1 | 1.021597 | 0.864069 | 0.249600 | 11:29 |
| 2 | 0.605416 | 0.599909 | 0.181200 | 11:27 |
| 3 | 0.362635 | 0.462564 | 0.143800 | 11:28 |
| 4 | 0.241691 | 0.422189 | 0.126800 | 11:29 |
GPU:0 process 2080424 uses 10274.000 MB GPU memory
learner_conv_large = learners[0]
evaluate(learner_conv_large,True)
Accuracy on good_test set: 0.6950 (139/200)
Well, that is the worst result we've had so far... Looks like the model overfit the data that we gave it.
We also had to utulize 100% of the compute of the machine available, and while utilizing available resources it good, using more compute overall is not a positive.
One idea for expansion would be to use a model ensemble, train many different models, with different architectures, and then use the mean output.
We could implement drop out to prevent this overfitting. It is also not necessarily overfitting, more tests would have to be run.
Another idea would be to try and separate out the fruits into differnt categories, like citrus etc. and then use expert models to classify the fruits later down the line. This way, each model could be more specialized, and hopefully, perform better.
But for now, since this was just a short exploration of working with machine vision classification models, it is good enough for now.
train_w_accumulation('vit_large_patch16_224', path, 224, learners, epochs=1, accum=4, finetune=False)
report_gpu()
model.safetensors: 0%| | 0.00/1.22G [00:00<?, ?B/s]
| epoch | train_loss | valid_loss | error_rate | time |
|---|---|---|---|---|
| 0 | 2.432510 | 1.784853 | 0.488200 | 11:28 |
GPU:0 process 2080424 uses 14830.000 MB GPU memory
That doesn't look very promising either, even though it is a different architecture. I would have to investigate further to find some better improvements.