git clone <REPO_URL>
cd search-assignment
code .Run the data generator to create four datasets with different characteristics:
python data_generator.py
# You may need to run python3 data_generator.pyThis will create:
datasets/customer_ids.json- 100,000 unsorted customer IDsdatasets/product_catalog.json- 50,000 pre-sorted product IDsdatasets/config_settings.json- 500 small config keys (unsorted)datasets/dictionary_words.json- 10,000 sorted dictionary wordsdatasets/test_cases.json- Test cases for validation
Open starter_code.py and complete the three search functions:
linear_search()- Sequential search through unsorted databinary_search_iterative()- Iterative binary search on sorted data. You that function is provided sorted data only.binary_search_recursive()- Recursive binary search on sorted data. You that function is provided sorted data only.
Uncomment the test functions in the __main__ block:
if __name__ == "__main__":
test_search_correctness() # Verify implementations work
benchmark_all_datasets() # Compare performance on all datasets
analyze_preprocessing_costs() # Analyze sorting trade-offsThen run:
python starter_code.py
# or python3 starter_code.pyScenario: Customer service database where IDs arrive in order received
Question: Is it worth sorting 100K entries to enable binary search?
Scenario: E-commerce inventory system maintained in sorted order
Question: How much faster is binary search on pre-sorted data?
Scenario: Application configuration frequently accessed during runtime
Question: Does algorithm choice matter for small datasets?
Scenario: Autocomplete feature for search-as-you-type
Question: Why keep dictionary sorted for prefix matching?
Submit two files:
- Your completed Python file (
starter_code.pyor rename tosearch_analysis.py) - Written analysis document (Google Doc or Word) containing:
- Performance Results section (150-200 words)
- Preprocessing Analysis section (150-200 words)
- Algorithm Selection section (200-300 words)
- Total: 500-700 words
Good luck! Remember: the best algorithm depends on your data characteristics and usage patterns.