You shouldn't need a permission slip to read a webpage–whether you do it with your own eyes, or use software to help. AI is a category of general-purpose tools with myriad beneficial uses. Requiring developers to license the materials needed to create this technology threatens the development of...
You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).
True, but “exact copy” almost certainly isn’t going to be what gets produced – and you can have a derivative work that isn’t an exact copy of the original, just generate something that looks a lot like part of the original. Like, you’d want to have a pretty good chance of finding a derivative work.
And that would mean that anyone who generates a model to would need to provide access their training corpus, which is gonna be huge – the models, which themselves are large, are a tiny fraction the size of the training set – and I’m sure that some people generating models aren’t gonna want to provide all of their training corpus.
You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).
You MIGHT even be able to do that while masking the data using hashing.
True, but “exact copy” almost certainly isn’t going to be what gets produced – and you can have a derivative work that isn’t an exact copy of the original, just generate something that looks a lot like part of the original. Like, you’d want to have a pretty good chance of finding a derivative work.
And that would mean that anyone who generates a model to would need to provide access their training corpus, which is gonna be huge – the models, which themselves are large, are a tiny fraction the size of the training set – and I’m sure that some people generating models aren’t gonna want to provide all of their training corpus.
Minhash might be able to produce a similarity metric without needing exactness and without revealing the training data.