in the cross attention modules, the Q input(latent array) will go through Q matrix where it will be processed to the actual Q input. The same goes for "length input array" which will go through K,V matrix to be processed into actual K,V inputs of the cross attention modules. The Q,K,V matrix(or embedding layer) will ensure that the inner dimension of Q and K will be identical.