This extends Assignment 1 using persistent data structures
and additional similarity metrics. It requires two programs.
- For each of at least 100 URLs, create a persistent
file-based B-Tree. Give each file a name based on its
URL in order to locate it easily.
- Load each B-tree with frequencies (possibly along
with other data) extending or changing those in
assignment 1 if applicable.
- Implement and use a fixed size buffer cache to reduce IO.
- Pre-categorize pages into 5 to 10 clusters using
k-means, k-mediods, or a similar metric.
Extend Assignment 1 to display a category (cluster) and most
similar key from the above data structures.