evaluate_external_ratings function

Evaluate how new typicality ratings predict human ratings and compares performance to LLM baselines