• tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 month ago

    You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).

    True, but “exact copy” almost certainly isn’t going to be what gets produced – and you can have a derivative work that isn’t an exact copy of the original, just generate something that looks a lot like part of the original. Like, you’d want to have a pretty good chance of finding a derivative work.

    And that would mean that anyone who generates a model to would need to provide access their training corpus, which is gonna be huge – the models, which themselves are large, are a tiny fraction the size of the training set – and I’m sure that some people generating models aren’t gonna want to provide all of their training corpus.

    • artificialfish
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 month ago

      Minhash might be able to produce a similarity metric without needing exactness and without revealing the training data.