They probably also do some OCR on that and then let something other run over that to see if the text makes sense (basically letting another AI grade the output, commonly done to judge what’s a good dataset and what isn’t) and then just feed the ai again. Today you have a shortage of data since the internet is too small (yes I know it sounds crazy) so I wouldn’t wonder if they actually tried to use pictures and ocr to gather a bit more usable data
They probably also do some OCR on that and then let something other run over that to see if the text makes sense (basically letting another AI grade the output, commonly done to judge what’s a good dataset and what isn’t) and then just feed the ai again. Today you have a shortage of data since the internet is too small (yes I know it sounds crazy) so I wouldn’t wonder if they actually tried to use pictures and ocr to gather a bit more usable data