Optimizing towards the reward mannequin initially improves summaries based on people, but ultimately overfits, giving worse summaries