This is theoretical as. But say you trained a NN to take the previous 120 frames and predict the next 120 frames of a video, and you could run it 60 times a second. You could then have a video playback that played the next frame based on the NN, but then predicted the next frame based on what came in. My theory is that it would be good enough to predict the next frames and essentially produce smooth (fake) video from a potentially choppy image. If it could run fast enough.
The cool thing about this is that you wouldn’t actually be watching a live person, you’d be watching the AI prediction of a live person, projected X frames into the future – this would effectively mean there is no lag because the AI is good enough to predict what you’re going to say/act and is always adjusting based on what you did say/do. We’re talking sub-second here, so it’s not predicting what you’re going to say, it’s predicting the changes in tone and pitch and where your face can get to in the next second. That part is totally possible. The part I think would be impossible is that you couldn’t do this fast enough and display the result. Maybe quantum computers.
The version of this I liked was to reduce your face to vectors using OpenCV, then deepfake your own face back onto the vector skeleton at the other end. You’d need a chunk of data at the start to get things moving, but then absolutely minimal data to send the vector skeleton at a buttery smoothe 120fps
Love it! Are you saying that this has been done? Sounds like something that could be implemented and sold to Zoom, Teams, or the highest bidder.
See my story on zoom AI frame prediction stuff BTW:
https://sjgemmell.com/who-zooms-he-zooms/