Don't take my word for it :)
Using ChatGPT 4o, we evaluated responses across these conditions and applied paired sample t-tests to assess statistical significance. Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
Read the paper on Arxiv |
And the boss:
Prof. Akhil Kumar of Smeal College @ PSU |
"To assess whether differences in model accuracy across varying politeness levels were statistically significant, we used the paired sample t-test. This test was best suited for our experimental design, wherein the same set of questions was presented to the language model under different tone conditions."
No comments:
Post a Comment