simple near-duplicate detection
This is a simple near-duplicate detection web app based on image-hash algorithm and BK-Tree data structure.
As webapp.py is run, initialize_bktree()
is called, the BK-Tree is initialized from
files/hashlist.txt.
Run webapp.py
to initialize server. Open a browser and navigate to "0.0.0.0:8000/html"
for a simple client demonstration.
You can also send POST messages using Postman or Advanced REST Client. The message type
should be multipart/form-data and the parameters should be:
- type="file", name="theimage" (the jpg or png file to add/search)
- name="request_query", value: "image" or "hash" or "id"
- name="request_id", value: ID of the added/searched image
-
webapp.py: Main function that sets up a Sanic web server. Functions include:
-
post_file_add(): The route handled by this function is /image/add/. A post request (as described in the Description part) should be sent, to add new images. Adding should be by image and id. Returns a json file with status "file received" or "existing ID".
- post_file_search(): The route handled by this function is /image/search/. To search for an image by id or image. Returns a json file with a list of duplicate IDs.
-
notify_server_stopping(app, loop): This is called before server stops to persist the added hashes by calling image_helper.persist_hash_tree()
-
-
image_helper.py: This module consists of all the background adding and searching functions. Functions include:
-
initialize_bktree(): Reads previously saved hashes from /files/hashlist.txt and builds a BK-Tree of Img(hash, id) object. Img is also a collection.
-
process_image(file): Returns a Pillow.Image object from a file.
-
find_hash(image): Gets an Image and returns its hash using the function in module hash_helper
-
find_hash_by_id(id): Gets an ID and searches in
id_hash_dict
dictionary which contains all id-hash tuples. -
add_image(image_hash, id): Creates an Img, checks if the ID is existing, then adds to the
hash_tree
andid_hash_dict
variables. -
find_duplicates(image_hash, distance): Searches the
hash_tree
for images whose hashes are from from hamming distance of at mostdistance
from the query hash. -
mydistance(img1, img2): Takes two Img object and returns the hamming distance of their hashes.
-
persist_hash_tree(): Saves
id_hash_dict
into /files/hashlist.txt file.
-
-
hash_helper.py This is for computing the hash of an image using imagehash library.