Nowdays, there are inference-time alignment techniques for language models, that can be extended to multimodal as well. It’s basically saying instead of doing alignment, that’s to say making sure the output fits a reward model on pre-training or right after pre-training, it can be done at inference time, meaning that it can use your own GPU, maybe on your phone, to tune a video so that it’s maximally, effectively polarizing for you, and using the reward model that you help train.