Finding and deleting duplicate files

This entry was posted by Wednesday, 20 October, 2010
Read the rest of this entry »

Okay, so you have a huge pile of mp3s and somehow managed to copy them repeatedly somewhere and now only want one copy of each? (hey! i do this all the time copying them from machine to machine!).
Best way to check that they are “identical” is with md5sum. This is how i deal with my problem.

find ./ -type f | while read file ; do md5sum "$file" >> md5list ; done # this gives me a file called md5sum with all the filenames and their md5sum
cat md5list | awk '{print $1}' | sort | uniq -c |grep -v 1\ | awk '{print $2}' >duplist # this checks for files with duplicate md5sum 
for i in `cat duplist` ; do grep $i md5list | sed "1,1d"| sed s/$i// >>rmlist; done # this outputs a list of files minus the first/top one so we are still left with one copy
cat rmlist  | while read line ; do mkdir bin ; echo removing $line ;mv "$line" bin/; done # this moves them all to a dir called bin/ which you can remove later
echo check bin/ for any files you accidently deleted # letting you know the above!

You probably want to remove the files md5list duplist and rmlist after you are done 🙂

2 Responses to “Finding and deleting duplicate files”

  1. You might want to look at the hardlink tool by Jakub Jelinek ( https://fedorahosted.org/hardlink/browser/hardlink.c). This way you don’t need to delete anything.

    If however you do want to delete the doubles entries you could do something like:

    find . -type f -links +1 -printf “%i %p\n” | sort -n

  2. cheers 🙂


Leave a Reply