Could plot value vectors too for each attention head... would definitely add to computational load though.