AI, Security, and Value of Data

AI applications depend on data security and will change our approach to data security.

Feb 24, 2024

AI is all about data. AI research is quickly published, much development happens in open-source space, computational resources are cheaper, and the gap in human expertise will eventually be overcome. Except for top research labs, it is no longer about the newest algorithms but more about the availability and quality of data that can be used in training or tuning specialized or private models. Data become the critical factor differentiating the effectiveness of solutions trying to solve specific problems with AI. Consequently, the value and importance of data are fundamentally changing, and we can expect these will only increase in the future. However, our current frameworks for data protection are not ready for AI on legal, technical, or ethical levels. First, actions were taken by news organizations and writers as they started to fight back against using their data for training big models. But the problem requires more attention, as the data-related challenges are much broader and apply to all regular users of AI applications, individual and organizational.

Data have always been at the center of cybersecurity as the critical assets that must be protected. Core security properties apply to data – we talk about Confidentiality (data are not disclosed), Integrity (data are not tampered with), and Availability (data are there when needed). The threats around data have been evolving with technology, accelerating with the move from physical to digital data storage. Suddenly, data could be easily replicated (without a trace), removed, or modified (unless mitigations are in place). The next significant milestone was the migration of data online from isolated systems, which exposed them to threats no longer limited by geographical locations. The data became available 24/7, and so did the attackers. Finally, the concepts of cloud and software-as-service resulted in most data being stored in 3rd-party systems, which have benefits but are also connected with shared responsibility for security, including data protection. These days, networked data flows are critical for all our activities, business operations, and interpersonal communication, which all consume and produce a lot of valuable data.

The value of digital data can be dramatically altered even without owners realizing it.

The value of digital data can be instrumental, e.g. granting a competitive advantage or directly leading to a financial gain, both in legal (e.g., building users’ profiles for advertising) or illegal ways (see the value of personal health records). Digital data may also have intrinsic value that can be contextual or very subjective, e.g., recorded personal experiences or content with emotional value like photos (with ransomware attacks aimed at their availability). The value of data may not be obvious and difficult to estimate, as we can look at it only in the local context (missing others’ points of view) or don’t realize the value until something bad happens. Digital data can be fragile and lost without proper mitigations (e.g., backups). The value of digital data can also be dramatically altered even without owners realizing it, e.g., when details of a strategy or a project are obtained by competitors (valid in business, but also a political context). For all these reasons, digital data needs proper attention, management, and protection in individual or organizational scopes. And the related requirements are changing significantly with AI applications in the loop.

Value of Data and AI

AI applications are focused on automation and augmentation of our decisions and cognitive tasks and require even further migration of our workflows to digital space. AI models need high-quality data during training, either used directly to the process or supporting it (e.g., feedback used in RLHF). AI solutions should also be expected to process a lot of data to deliver value when used, i.e., during an inference. All the concerns related to Confidentiality, Integrity, and Availability of data (commonly referred to as CIA) still fully apply in AI scenarios, but there are also new ones related to the ability to learn. Training, fine-tuning, and customization of models are generally connected with extracting value from data and developing unique capabilities along the way. A successful model is expected to eventually deliver results of comparable quality (sometimes better, e.g., when data from multiple sources are combined), usually faster, with much less effort, and with reduced costs.

Because of AI, the instrumental value of data significantly increases.

Through training AI, data can be used to recreate individual or organizational competencies without otherwise required investments, domain-specific skills, and experiences. It doesn’t matter if the data were leaked, intentionally shared for training a model, or used with an external AI system without paying attention to a usage agreement or by selecting untrustworthy service providers. Irresponsible data sharing can have new consequences, often irrevocable, related to the potential loss of unique value built up local history and experiences, including mistakes and lessons from them. Previously, in the case of a data compromise, results of work were at stake; now, it is about skills and abilities to deliver similar results in the future. We can say that because of AI, the instrumental value of data significantly increases, and much more can go wrong if there is a security failure. And all that fully applies to individuals, organizations, and individuals in organizations.

Data that are commonly used in AI applications have some unique characteristics that create some challenges from the security point of view:

AI operates on complex data, usually unstructured (e.g., multimedia streams rather than database tables), which are more difficult and expensive to process and validate. That applies especially to the input and outputs of generative AI models, where core scenarios often include processing images, sound, and videos, working with text documents, or continuous interactions with users.
The nature of interactions with AI components is dynamically evolving. They started with request-like prompts (potentially replacing traditional search queries) and are quickly heading toward natural voice-based conversations, with visual avatars to follow soon. Interactions between humans and AI are not isolated but become more like relationships with well-established contexts.
Many AI applications process sensitive data (financial, healthcare, biometric, etc.) in training and inference, with new challenges around their detection and classification. For example, a photo must be processed before sensitive elements can be identified (a strong requirement for PCI), or it is easier to include confidential data in interactions that seem natural.
Most current AI applications send data to remote servers, which may further interact with additional external data sources or AI components, making trust boundaries unclear and often hidden. Eventually, we can expect more solutions running locally (AI PC concept), but detailed user data will still be shared externally when integration with more advanced systems is needed.
Many data governance practices rely on anonymized data, removing sensitive fields, or using synthetic data. These practices reduce risks but may not fully address AI security concerns, as such data can be useful for machine learning models (also a benefit!). In other words, anonymized or quality synthetic data may still be very valuable in the context of AI threats, even though their value is reduced for traditional attacks.
Data copyrights deserve a separate post; even though those are primarily legal and economic issues, they are also connected with data protection. The use of copyrighted materials is often considered necessary for training AI models. There are many creative arguments as to why copyrights should be dismissed in training, and at the same time, we have growing concerns about data laundering, which defines specific requirements for the security of data supply chains.

Organizational Context

Let’s take a closer look at challenges related to data and AI, starting with organizations, as, surprisingly, that context should be a little bit easier than for individuals. Data protection is not a new problem for most organizations, as they usually have some structure and processes in place and awareness of their responsibilities (even if motivated only by applicable regulations). Organizations are focused on protecting their business value and mitigating traditional threats like preventing data leaks or protecting intellectual property. Digital data flows are essential and closely integrated with their operations with the need to manage risks to all external interactions and dependencies on 3rd party resources (like extensive use of the cloud). Integrations with AI components will require changes to that model, as it now needs to cover the risks related to sending data to AI models, controlling the use of obtained results, and managing the security of AI applications themselves, including vendors and supply chains. Again, the consequences of failure may be critical for an organization and its business value; as an individual could be replaced with AI automation, so can companies when other entities, including competitors, gain the ability to deliver comparable results at a lower cost.

Models trained on local data should become the most protected assets for organizations.

The most attention is currently paid (correctly) to integrations of organizations with external AI solutions. There are unclear rules, complex legal agreements, and all mitigations on the market (e.g., LLM firewalls) are new, untested, and often with unrealistic promises (e.g., guaranteeing the security of code snippets from AI). With enterprise licenses, AI solutions are not supposed to store customer data or use them for training, and ownership of customized models should be clarified. Still, that means that data flows are crossing external trust boundaries, and conversation-based interactions are more likely to include sensitive data (even unintentionally), especially when we are not talking about well-defined sensitive data (including attempts to bypass established guardrails). A lot could be learned about organizational priorities based on their search queries and even more based on continuous daily interactions with AI systems (which should be captured for security purposes in most cases). Of course, there are also concerns for internal or hybrid AI solutions (a preferred option for high-risk or mission-critical applications), covering both inputs to such systems and the downstream use of results. It is easy to misuse new capabilities by working with inappropriate data or not paying enough attention to the quality of results and their impact (that can get very interesting when an AI system is exposed to external customers). We also need to remember that AI models can be stolen directly or through interactions with them. As a rule, models trained on local data should become the most protected assets for organizations throughout their lifecycle, including proper access control, continuous monitoring, and incident response. Eventually, well-trained models may become functional representations of unique organizational value, making them very attractive targets for external attacks and insider threats (new internal attack surface). Balancing internal access to the most effective models will be one of the organizational and cultural challenges we still need to learn to manage.

Individual Context

Data protection in AI scenarios may be even harder in the context of individual users. With the arrival of the Internet, we failed to protect individual privacy sufficiently, and we are still working to fix these problems. As a result, personal data are weakly protected, shared uncontrollably, commercially exchanged (e.g., problems with data brokers), and used in monetization scenarios, which are often legal but not always ethical. We are likely facing similar challenges with AI, except now there is even more at stake. Individual users are much more reliant on systemic solutions and regulations (more than organizations), which are, unsurprisingly, very limited, with minor exceptions like progress with the EU AI Act. At the same time, we can see a significant push for integrating AI with consumer products and individual workflows, fundamentally changing the ways of working with personal devices. Activities that until recently were assumed to be local and private suddenly become fully integrated with and dependent on remote systems. That is perfectly fine when such scenarios provide value for users, and they understand and accept the consequences of sharing data with AI. The problem begins when rules for sharing and processing individual data are unclear, misleading, or designed to benefit technology or service providers asymmetrically.

Improving an AI product means much more than finding and reporting a software bug.

Scenarios that AI can effectively support are much more complex than activities related to browsing habits or using social media. It is not only about the final product of our work that could be shared – AI can be the most helpful in creative or problem-solving processes, including all intermediate steps. This type of individual insights from interactions with AI systems can be very useful for improving the quality of models, and we have already seen the impact on individual artists sharing their creations and discussing the need for Credit, Compensation, and Consent. In practice, individual data might already be used in training AI models, complex usage agreements don’t help, and cases of deceptive design patterns become increasingly common. That could be an excellent opportunity for improving communication around data usage, for example, with user-friendly nutrition labels for privacy, but such a change would have to be driven either by regulation or customers. Instead, most advanced services collect user data (e.g., conversations) to improve products and services. It should be clear that improving an AI product means much more than finding and reporting a bug in traditional software. It can effectively mean contributing unique input and extending the capabilities of the AI model, which could be later used to produce results for different contexts or users, potentially affecting the value of the contributor’s data or work. On top of that, we have service providers operating with limited transparency, lack of security maturity, or straight suspicious behavior, sometimes in scenarios that enable the collection of critical data about users. Unrestricted use of individual data can enable detailed and accurate individual profiles covering communication style, behavioral patterns, or emotional reactions, which would be very valuable and dangerous when used in malicious scenarios. Even with legitimate scenarios, severe discussions about the Right of Publicity (a.k.a. personality rights) exist in the context of deep fakes.

The success of AI applications depends on data security; at the same time, the experiences with AI will inevitably change our approach to data security. As the value of data is dynamically changing, it becomes essential that we have clear, fair, and usable rules in place for sharing data with AI, using results in specific contexts, transparency and accountability, acceptable business practices, or proper distribution of benefits. Solving these problems is necessary not only for broad commercial applications but also to implement some of AI's most exciting research opportunities, for example, in the healthcare space. The challenges related to data in AI will require progress in multiple domains, from legal and regulatory frameworks, through new economic models for data exchange, to addressing ethical considerations around individual contributions. Still, these new rules will eventually be translated into requirements, mechanisms, and practices for data protection, monitoring, or detecting misuse and abuse. That will require changes to existing methodologies, the development of new tools, and investments into focused threat modeling efforts (going beyond CIA) to help us understand the role and behavior of AI components to ensure that the value of data is correctly identified, tracked, and protected. In the meantime, we should be careful with contributing or even using AI systems in scenarios that we consider to be the most important. We must avoid offers where costs are unclear, conditions sound too good to be true, or not all involved participants are verified as trustworthy. AI applications are still very new and not fully baked, and as we focus our attention on opportunities and benefits, we also need to prepare for potentially surprising consequences of security failures.

Updated on Feb 27th, 2024, with minor fixes based on received feedback.

K.S.

Feb 26, 2024

Great read. What about who owns the output from General AI systems? If a developer uses GIT AI to generate code (data). Couldn't Git argue that because the dev used their AI models that were generated on data they owned. Doesn't that mean the output is now owned by Git?

Also who owns the copyright to photos that have been modified by AI?

Lots of questions around data and ownership.

Expand full comment

2 replies by Tomek Ostwald and others

Taimur Ijlal

great points ! I think the key issue is that AI needs data and lots of it to function effectively and this is going to need a relook at how we manage data security. Attacks like data poisoning are only going to increase going forward as well as the privacy implications of storing so much information

1 reply by Tomek Ostwald

3 more comments...

Notes about AI Security

Discussion about this post