Analyzing Hugging Face Datasets¶
This notebook shows how you can use fastdup to analyze any dataset from Hugging Face Datasets.
We will analyze an image classification dataset for:
- Duplicates / near-duplicates
- Outliers
- Wrong labels
Installation¶
import sys
if "google.colab" in sys.modules:
# Running in Google Colab
!pip install --force-reinstall --no-cache-dir numpy==1.26.4 scipy fastdup datasets
else:
# Running outside Colab
!pip install -Uq fastdup datasets
!pip install -Uq pillow
Now, test the installation. If there's no error message, we are ready to go.
import fastdup
fastdup.__version__
'2.0.21'
Load Dataset¶
In this example we load the Tiny ImageNet dataset from Hugging Face Datasets..
Tiny ImageNet contains 100,000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.
Let's load the dataset into our local directory.
from fastdup.datasets import FastdupHFDataset
dataset = FastdupHFDataset("zh-plus/tiny-imagenet", split="train")
We can inspect the dataset object.
dataset
Dataset({
features: ['image', 'label'],
num_rows: 100000
})
dataset[0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64>,
'label': 0}
dataset[0]['image']
dataset[0]['label']
0
dataset.annotations
| filename | label | |
|---|---|---|
| 0 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/142/71384.jpg | 142 |
| 1 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/142/71204.jpg | 142 |
| 2 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/142/71036.jpg | 142 |
| 3 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/142/71014.jpg | 142 |
| 4 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/142/71334.jpg | 142 |
| ... | ... | ... |
| 99995 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/127/63864.jpg | 127 |
| 99996 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/127/63822.jpg | 127 |
| 99997 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/127/63874.jpg | 127 |
| 99998 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/127/63824.jpg | 127 |
| 99999 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/127/63752.jpg | 127 |
100000 rows × 2 columns
Run fastdup¶
fd = fastdup.create(input_dir=dataset.img_dir)
fd.run(annotations=dataset.annotations)
Inspect Issues¶
There are several methods we can use to inspect the issues found:
fd.vis.duplicates_gallery() # create a visual gallery of duplicates
fd.vis.outliers_gallery() # create a visual gallery of anomalies
fd.vis.component_gallery() # create a visualization of connected components
fd.vis.stats_gallery() # create a visualization of images statistics (e.g. blur)
fd.vis.similarity_gallery() # create a gallery of similar images
fd.vis.duplicates_gallery()
/home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages/fastdup/galleries.py:102: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL)) /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages/fastdup/galleries.py:102: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
Generating gallery: 0%| | 0/20 [00:00<?, ?it/s]
Stored similarity visual view in work_dir/galleries/duplicates.html ######################################################################################## Would you like to see awesome visualizations for some of the most popular academic datasets? Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup ########################################################################################
Duplicates Report
| Info | |
|---|---|
| Distance | 1.0 |
| From | /67/33675.jpg |
| To | /125/62847.jpg |
| From_Label | 67 |
| To_Label | 125 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /199/99746.jpg |
| To | /177/88551.jpg |
| From_Label | 199 |
| To_Label | 177 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /190/95277.jpg |
| To | /13/6631.jpg |
| From_Label | 190 |
| To_Label | 13 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /141/70895.jpg |
| To | /8/4204.jpg |
| From_Label | 141 |
| To_Label | 8 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /14/7463.jpg |
| To | /198/99073.jpg |
| From_Label | 14 |
| To_Label | 198 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /35/17643.jpg |
| To | /37/18797.jpg |
| From_Label | 35 |
| To_Label | 37 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /180/90258.jpg |
| To | /174/87495.jpg |
| From_Label | 180 |
| To_Label | 174 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /67/33640.jpg |
| To | /125/62558.jpg |
| From_Label | 67 |
| To_Label | 125 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /138/69225.jpg |
| To | /102/51355.jpg |
| From_Label | 138 |
| To_Label | 102 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /67/33973.jpg |
| To | /125/62815.jpg |
| From_Label | 67 |
| To_Label | 125 |
| Info | |
|---|---|
| Distance | 1.0 |
| From | /17/8525.jpg |
| To | /16/8111.jpg |
| From_Label | 17 |
| To_Label | 16 |
0
fd.vis.outliers_gallery()
Generating gallery: 0%| | 0/20 [00:00<?, ?it/s]
Stored outliers visual view in work_dir/galleries/outliers.html ######################################################################################## Would you like to see awesome visualizations for some of the most popular academic datasets? Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup ########################################################################################
Outliers Report
Showing image outliers, one per row
| Info | |
|---|---|
| Distance | 0.600712 |
| Path | /198/99254.jpg |
| label | 198 |
| Info | |
|---|---|
| Distance | 0.639867 |
| Path | /12/6152.jpg |
| label | 12 |
| Info | |
|---|---|
| Distance | 0.642672 |
| Path | /94/47232.jpg |
| label | 94 |
| Info | |
|---|---|
| Distance | 0.654982 |
| Path | /35/17626.jpg |
| label | 35 |
| Info | |
|---|---|
| Distance | 0.663625 |
| Path | /10/5240.jpg |
| label | 10 |
| Info | |
|---|---|
| Distance | 0.665014 |
| Path | /173/86745.jpg |
| label | 173 |
| Info | |
|---|---|
| Distance | 0.666785 |
| Path | /197/98818.jpg |
| label | 197 |
| Info | |
|---|---|
| Distance | 0.668334 |
| Path | /54/27267.jpg |
| label | 54 |
| Info | |
|---|---|
| Distance | 0.668349 |
| Path | /78/39235.jpg |
| label | 78 |
| Info | |
|---|---|
| Distance | 0.668735 |
| Path | /196/98461.jpg |
| label | 196 |
| Info | |
|---|---|
| Distance | 0.66936 |
| Path | /54/27129.jpg |
| label | 54 |
| Info | |
|---|---|
| Distance | 0.671666 |
| Path | /84/42148.jpg |
| label | 84 |
| Info | |
|---|---|
| Distance | 0.672583 |
| Path | /145/72520.jpg |
| label | 145 |
| Info | |
|---|---|
| Distance | 0.673422 |
| Path | /94/47006.jpg |
| label | 94 |
| Info | |
|---|---|
| Distance | 0.67446 |
| Path | /196/98207.jpg |
| label | 196 |
| Info | |
|---|---|
| Distance | 0.674789 |
| Path | /196/98021.jpg |
| label | 196 |
| Info | |
|---|---|
| Distance | 0.676092 |
| Path | /197/98911.jpg |
| label | 197 |
| Info | |
|---|---|
| Distance | 0.677318 |
| Path | /87/43785.jpg |
| label | 87 |
| Info | |
|---|---|
| Distance | 0.678071 |
| Path | /160/80147.jpg |
| label | 160 |
| Info | |
|---|---|
| Distance | 0.678366 |
| Path | /140/70208.jpg |
| label | 140 |
0
fd.vis.similarity_gallery(slice='diff')
Generating gallery: 0%| | 0/7287 [00:00<?, ?it/s]
Generating gallery: 0%| | 0/20 [00:00<?, ?it/s]
Stored similar images visual view in work_dir/galleries/similarity.html ######################################################################################## Would you like to see awesome visualizations for some of the most popular academic datasets? Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup ########################################################################################
Similarity Report, label_score
| Info From | |
|---|---|
| label | 0 |
| from | /0/35.jpg |
| Info To | ||
|---|---|---|
| 0.906011 | /85/42517.jpg | 85 |
| 0.905423 | /190/95331.jpg | 190 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/513.jpg |
| Info To | ||
|---|---|---|
| 0.911764 | /85/42716.jpg | 85 |
| 0.907565 | /166/83446.jpg | 166 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/515.jpg |
| Info To | ||
|---|---|---|
| 0.933797 | /9/4557.jpg | 9 |
| 0.931858 | /93/46608.jpg | 93 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/521.jpg |
| Info To | ||
|---|---|---|
| 0.916001 | /5/2756.jpg | 5 |
| 0.915641 | /5/2731.jpg | 5 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/650.jpg |
| Info To | ||
|---|---|---|
| 0.923444 | /7/3800.jpg | 7 |
| 0.909015 | /17/8647.jpg | 17 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/657.jpg |
| Info To | ||
|---|---|---|
| 0.930722 | /166/83497.jpg | 166 |
| 0.930567 | /17/8749.jpg | 17 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/671.jpg |
| Info To | ||
|---|---|---|
| 0.925565 | /198/99447.jpg | 198 |
| 0.917802 | /17/8681.jpg | 17 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/692.jpg |
| Info To | ||
|---|---|---|
| 0.915914 | /15/7715.jpg | 15 |
| 0.907856 | /2/1496.jpg | 2 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/712.jpg |
| Info To | ||
|---|---|---|
| 0.906601 | /197/98725.jpg | 197 |
| 0.903525 | /196/98381.jpg | 196 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/732.jpg |
| Info To | ||
|---|---|---|
| 0.906571 | /5/2741.jpg | 5 |
| 0.900979 | /9/4555.jpg | 9 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/737.jpg |
| Info To | ||
|---|---|---|
| 0.930051 | /17/8949.jpg | 17 |
| 0.926385 | /35/17505.jpg | 35 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/763.jpg |
| Info To | ||
|---|---|---|
| 0.942992 | /195/97948.jpg | 195 |
| 0.940392 | /17/8642.jpg | 17 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/769.jpg |
| Info To | ||
|---|---|---|
| 0.923526 | /46/23481.jpg | 46 |
| 0.914404 | /5/2583.jpg | 5 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/852.jpg |
| Info To | ||
|---|---|---|
| 0.923934 | /148/74041.jpg | 148 |
| 0.920057 | /7/3839.jpg | 7 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/857.jpg |
| Info To | ||
|---|---|---|
| 0.906768 | /24/12085.jpg | 24 |
| 0.904972 | /191/95642.jpg | 191 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/868.jpg |
| Info To | ||
|---|---|---|
| 0.909222 | /3/1599.jpg | 3 |
| 0.905293 | /111/55995.jpg | 111 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/871.jpg |
| Info To | ||
|---|---|---|
| 0.914131 | /145/72763.jpg | 145 |
| 0.913387 | /9/4771.jpg | 9 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/899.jpg |
| Info To | ||
|---|---|---|
| 0.911091 | /7/3800.jpg | 7 |
| 0.905312 | /5/2839.jpg | 5 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/948.jpg |
| Info To | ||
|---|---|---|
| 0.924178 | /40/20152.jpg | 40 |
| 0.921691 | /198/99487.jpg | 198 |
| Query Image |
| Similar |
| Info From | |
|---|---|
| label | 1 |
| from | /1/964.jpg |
| Info To | ||
|---|---|---|
| 0.939224 | /35/17505.jpg | 35 |
| 0.939199 | /17/8749.jpg | 17 |
| Query Image |
| Similar |
| from | to | label | label2 | distance | score | length | |
|---|---|---|---|---|---|---|---|
| 4 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/0/35.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/190/95331.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/85/42517.jpg] | [0, 0] | [190, 85] | [0.905423, 0.906011] | 0.0 | 2 |
| 14 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/1/513.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/166/83446.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/85/42716.jpg] | [1, 1] | [166, 85] | [0.907565, 0.911764] | 0.0 | 2 |
| 16 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/1/515.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/93/46608.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/9/4557.jpg] | [1, 1] | [93, 9] | [0.931858, 0.933797] | 0.0 | 2 |
| 19 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/1/521.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/5/2731.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/5/2756.jpg] | [1, 1] | [5, 5] | [0.915641, 0.916001] | 0.0 | 2 |
| 68 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/1/650.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/17/8647.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/7/3800.jpg] | [1, 1] | [17, 7] | [0.909015, 0.923444] | 0.0 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 7273 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49882.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49977.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49544.jpg] | [99, 99] | [99, 99] | [0.904351, 0.913638] | 100.0 | 2 |
| 7275 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49895.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49799.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49806.jpg] | [99, 99] | [99, 99] | [0.92261, 0.92414] | 100.0 | 2 |
| 7279 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49919.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49734.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49875.jpg] | [99, 99] | [99, 99] | [0.904262, 0.913118] | 100.0 | 2 |
| 7283 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49940.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49877.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49858.jpg] | [99, 99] | [99, 99] | [0.91175, 0.914616] | 100.0 | 2 |
| 7285 | /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49977.jpg | [/home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49644.jpg, /home/dnth/.cache/huggingface/datasets/tiny-imagenet/jpg_images/99/49705.jpg] | [99, 99] | [99, 99] | [0.913667, 0.917606] | 100.0 | 2 |
3796 rows × 7 columns
Interactive Exploration¶
In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.
To explore the dataset and issues interactively in a browser, run:
fd.explore()
🗒 Note - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.
You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.

Wrap Up¶
That's a wrap! In this notebook, we showed how to get mislabels from a labeled dataset.
Next, feel free to check out other tutorials -
- ⚡ Quickstart: Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
- 🧹 Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
- 🖼 Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
- 🎁 Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.
As usual, feedback is welcome! Questions? Drop by our Slack channel or open an issue on GitHub.






