Uncategorized

Efficient Text Search for Large Term List: Python vs PostgreSQL vs Elasticsearch


I have a list containing terms that vary in length from 1 to 10 words, with approximately 500,000 entries. My goal is to search for these terms in a long text (converted from a PDF, typically 1.5 to 2 pages long). I need to perform the searches not only as exact matches but also using fuzzy (e.g., the term ‘Lionel Messi’ should match ‘Lionel Mesi’ in the text) and near options (e.g., the term ‘Lionel Messi’ should match ‘Lionel J. Messi’ in the text).

I aim to solve this problem in near real-time (1-2 second). I’ve tried using trie data structures and parallelization, but especially when the fuzzy aspect comes into play, the large size of the list and the PDF length lead to long processing times (about 30 seconds).

How should I approach this problem?

  1. Can I handle it on the fly with Python libraries (using parallelization, trie structures, etc.)?
  2. Are there features in PostgreSQL that support such searches?
  3. Should I use a framework like Elasticsearch?”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *