- cross-posted to:
- data_structures
- cross-posted to:
- data_structures
Traditionally, algorithms for counting distinct items in a stream of data would store all the items. A new algorithm, called CVM, uses randomization to estimate the number of distinct items with minimal memory usage. The trick is to keep track of items by recording them and then randomly deleting some. The probability of an item staying on the list is related to the number of rounds it survives. With this method, the researchers were able to accurately estimate the number of distinct words in Hamlet.
Optimize Memory and Performance with the CVM Algorithm in JavaScript: A Comprehensive Guide for Text Analysis, Big Data, and More
const cvmAlgorithm = t => { const f = n => !(Math.random() * (1 << n) | 0); const w = t.match(/\b\w+\b/g) || []; let r = 1, u = new Set(); for (let i of w.map(x => x.toLowerCase())) u.size < 100 ? u.add(i) : f(r) && u.delete(i), u.size >= 100 && r++, u.size > 50 && (u = new Set([...u].filter(() => f(r)))); return [...u]; }; // Example usage: const text = "To be, or not to be, that is the question..."; const finalWords = cvmAlgorithm(text); console.log(finalWords); // [ 'to', 'be', 'or', 'not', 'that', 'is', 'the', 'question' ]