The final input to the LLM is still a continuous feature representation for i2t

Hi， It seems that for visual understanding, although the model uses VQ to discretize the encoding of images, the final input to the LLM is still a continuous feature representation.

I doubt whether it can still be called a discrete tokenizer.