-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
insane fast is not very fast at all #53
Comments
bin
and put btrfs.static into RELEASE tab
My guess is that instead of using a sorting algorithm to find rows of duplicated files, your code might have constructed a 𝑛²-𝑛 size array (𝑛 refer to the number of files in the recursive directory) and checked every file pair in the array, right? |
Yes, right. It does check two file at once since it doesn't store previously computed results on disk. "Insane" mode is fast with real dedupe between two files (say comparing two 100GB files). But its slow with large no.of files. I'll update the tool to store previously computed results in db. (sqlite) That should improve all over performance. |
Hi how aobut this real quick way to solve this problem First, bare with me explaining how people deduplicate files before the dedup tools were popular. Deduplicating with existing Linux commandsBack in the 90s when Bojack Horseman was horsing around, people were used to do this:
Which means compute the md5 checksum of every file under directory, then sort it (hence duplicated files are clustered together), then run AssumptionLet's assume that there is a quick way to get a file's fingerprint of some sort by using btrfs New methodNow, given that sorting and uniqueness (duplicate detect) are already in the Linux command line set, you only need to add a parameter such as --fingerprint, which outputs every file's fingerprint followed by the file's name. the fingerprint can be a 32-character data computed by computing a checksum of all csums of all its blocks. A user can then use a pipe to get the duplicated items.
And you can put that info in the EXAMPLE section of your man page. The insanely fast but inaccurate method already working todayThe following method didn't take advantage of btrfs, meaning btrfs should be able to outperform this. There is a shortcut to get potential duplicates insanely fast by simply compare the file size (in the case of video mp4 files larger than 1GB, files of equal size is usually duplicate).
The additional But there are video files of equal size with different content, so a hash should be computed to be sure. |
Off-topic: One of my other project server with more than 225000 users burned 😭 http://community.webminal.org/t/webminal-org-down-status-update-thread/1481 I will get back so this issue soon |
Would it be possible to perform block scanning and hashes in parallel using process pool or even using Numba please? |
A missing point is the "os.walk()"... If you have many more than 767 files you will get some problems with current implementation here: Lines 427 to 443 in 8745078
|
Added sqlite db to keep track of file csum fetched from btrfs csum-tree. It should increase the performance. Could you please check the latest code from this repo? thanks! |
I believe this still did not solve the problem. After 18h24m8s of waiting I quited the process
bedup scans this filesystem in a few minutes (after purging its cache).
|
@adamryczkowski could you pls share dduper.log file? It should be available from the directory where you run this command. |
It is very short. Here you go:
|
ah, okay. Looks like it never completed populating DB phase. Can you share output for the below command:
It should display no.of records (file entry) on the DB. |
|
|
|
when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:
Since csum is already computed, this shouldn't take more than a minute in a modern computer. Instead, the process has been running 30 minutes now and the result already showing 2020 non-matching results.
The text was updated successfully, but these errors were encountered: