Skip to content

Commit

Permalink
fix bug caused by deleting keys as we iterated through index
Browse files Browse the repository at this point in the history
closes #1029
  • Loading branch information
fgregg authored Jun 1, 2022
1 parent 0ee14ec commit 1d01808
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions dedupe/canopy_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,13 @@ def initSearch(self):
N = len(self.index._docweight)
threshold = int(max(1000, N * 0.05))

stop_words = []
self._wids_dict = {}

bucket = self.index.family.IF.Bucket
for wid, docs in self.index._wordinfo.items():
if len(docs) > threshold:
word = self.lexicon._words[wid]
logger.info("Removing stop word {}".format(word))
del self.index._wordinfo[wid]
stop_words.append(wid)
continue
if isinstance(docs, dict):
docs = bucket(docs)
Expand All @@ -37,6 +36,11 @@ def initSearch(self):
term = self.lexicon._words[wid]
self._wids_dict[term] = (wid, idf)

for wid in stop_words:
word = self.lexicon._words.pop(wid)

This comment has been minimized.

Copy link
@oreccb

oreccb Jun 8, 2022

#1029 @fgregg - I think this line needs to be word = self.lexicon._words.get(wid). I ran a training today and got a KeyError on the above line term = self.lexicon._words[wid] since wids were popped and removed from self.lexicon._words.

This comment has been minimized.

Copy link
@fgregg

fgregg Jun 30, 2022

Author Contributor

can you open a new issue with a trace back?

This comment has been minimized.

Copy link
@oreccb

oreccb Jul 1, 2022

for sure: #1069

logger.info("Removing stop word {}".format(word))
del self.index._wordinfo[wid]

def apply(self, query_list, threshold, start=0, count=None):
_wids_dict = self._wids_dict
_wordinfo = self.index._wordinfo
Expand Down

0 comments on commit 1d01808

Please sign in to comment.