Sdks/libs, especially open source sdks, were never about gated knowledge. They were about the providing company making it as easy as possible for you to integrate. You would not need to know the idiosyncrasies behind api retries, paging, rate limits, auth flow, and on and on. The third party developers needed a resource, they call a method and get it. Open source libraries especially are about pooling knowledge, not gating it. This is propaganda for pooling that knowledge inside a service you have to pay to use, and instead of developers all using and improving the same codebase together, they have to spend money to rewrite the same code repeatedly. This is AI companies further trying to undercut open source because it’s free.
Most corporations likely have zero data retention agreements with LLM providers, at least for API usage.
(Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)
> (Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)
Is actual ZDR verbiage in contracts more specific and limited in scope than what we see advertised publicly ("...except where needed to comply with law or combat misuse" in Anthropic's case)? Because those seem pretty damn vague and large enough holes to drive trucks through.
The problem is that there is an enormous, nearly unignorable incentive to work around it. So they will.
As the customer base becomes more and more corporate (which it will), they end up with disproportionately more customers whose experiences cannot be used to train the model to make it better for those customers.
Either way, corporate customers cannot leach off the training from consumers handing over their personal data forever; there aren't enough specialists in that training set to improve the models with no loss of corporate trust.
But those links are Googled after the model started to answer, they are not the links to the training data
Imagine an artificial “librarian” that read all the books and spits hallucinated quotes for you
But doesn’t let you enter the library, open a single book or even see the sources for those hallucinated quotes
But instead Googles some sources based on hallucinations after generating them ;-)
It’s better than nothing but you can Google them, too, while training data (the library) is completely hidden from you, even the public domain parts of it - zero attribution
There should be at least some correlation. When building the model they give more weight to some pages (e.g. Wikipedia) which have bigger trust (pagerank?). And when they provide links in answers, those matches are listed first which have better pagerank for the query.
So if it sources something in Wikipedia, it is more likely to provide Wikipedia as a trusted source for it.
The problem is when an answer is hallucinated, false, it may provide a source for it which contains the invalid info.
now that any software/knowledge is copyable given sufficient cash and AIs, gating knowledge migth be the only thing that protects your business. otherwise you do not have business.
Sadly it has been during most of human history. I think the establishment resents the masses becoming over educated. The 1990s internet had a wealth of views and information on it. Now you can only access approved sources thanks to scaremongering.
Like, you are letting them data mine your business. Why are corporations not panicing over this?
(Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)
Is actual ZDR verbiage in contracts more specific and limited in scope than what we see advertised publicly ("...except where needed to comply with law or combat misuse" in Anthropic's case)? Because those seem pretty damn vague and large enough holes to drive trucks through.
As the customer base becomes more and more corporate (which it will), they end up with disproportionately more customers whose experiences cannot be used to train the model to make it better for those customers.
Either way, corporate customers cannot leach off the training from consumers handing over their personal data forever; there aren't enough specialists in that training set to improve the models with no loss of corporate trust.
Betrayal of their trust is inevitable.
Imagine Google search without any links or sources named
This is the “modern” AI chatbot:
It never mentions the training data it used, in fact has no idea what it used (often FB, Reddit and partisan websites)
Update: I added the reply about after the fact Googling chatbots do - it’s different
Or at least some of the sites, if the same info is sourced from 100 pages then it only shows 2 or 3, maybe the ones with the biggest PageRanks.
But those links are Googled after the model started to answer, they are not the links to the training data
Imagine an artificial “librarian” that read all the books and spits hallucinated quotes for you
But doesn’t let you enter the library, open a single book or even see the sources for those hallucinated quotes
But instead Googles some sources based on hallucinations after generating them ;-)
It’s better than nothing but you can Google them, too, while training data (the library) is completely hidden from you, even the public domain parts of it - zero attribution
So if it sources something in Wikipedia, it is more likely to provide Wikipedia as a trusted source for it.
The problem is when an answer is hallucinated, false, it may provide a source for it which contains the invalid info.