With enterprises pouring billions of dollars into generative AI (genAI) initiatives, doubts about future legal exposures are typically ignored.
The risks are practically endless. Although enterprises usually do extensive data fine-tuning before deploying large language models (LLMs), the massive underlying database is unknown. The major model makers — including OpenAI, Google, AWS, Anthropic, Meta, and Microsoft — provide no visibility into their training data. That includes how old or out-of-date it is, how reliable it is, source languages, and, critically, whether the data violates privacy rules, copyright restrictions, trademarks, patents, or regulatory sensitive data (healthcare data, financial data, PII, payment card details, security credentials, etc.).
Even when vendors provide source lists for the data used to train their models, those lists may not include meaningful data. For example, a source might be “Visa transaction information.” How old? Is it verified? Has it been sufficiently sanitized for compliance?
Read the full article here