IWS - The Information Warfare Site
News Watch Make a  donation to IWS - The Information Warfare Site Use it for navigation in case java scripts are disabled


NCSC TECHNICAL REPORT - 005
 

Volume 1/5

Library No. S-243,039

FOREWARD

This report is the first of five companion documents to the Trusted Database Management System
Interpretation of the Trusted Computer System Evaluation Criteria. The companion documents
address topics that are important to the design and development of secure database management
systems, and are written for database vendors, system designers, evaluators, and researchers. This
report addresses inference and aggregation issues in secure database management systems.

ACKNOWLEDGMENTS

The National Computer Security Center extends special recognition to the authors of this
document. The initial version was written by Victoria Ashby and Sushil Jajodia of the MITRE
Corporation. The final version was written by Gary Smith, Stan Wisseman, and David Wichers of
Arca Systems, Inc.

The documents in this series were produced under the guidance of Shawn P. O'Brien of the
National Security Agency, LouAnna Notargiacomo and Barbara Blaustein of the MITRE
Corporation, and David Wichers of Arca Systems, Inc.

We wish to thank the members of the information security community who enthusiastically gave
of their time and technical expertise in reviewing these documents and in providing valuable
comments and suggestions.

TABLE OF CONTENTS

SECTION PAGE

1.0 INTRODUCTION l

1.1 BACKGROUND AND PURPOSE 1

1.2 SCOPE 1

1.3 INTRODUCTION TO INFERENCE AND AGGREGATION 2

1.4 AUDIENCES OF THIS DOCUMENT 3

1.5 ORGANIZATION OF THIS DOCUMENT 4

2.0 BACKGROUND 5

2.1 INFERENCE AND AGGREGATION - DEFINED AND EXPLAINED 5

2.1.1 Inference 5

2.1.1.1 Inference Methods 6

2.1.1.2 Inference Information 7

2.1.1.3 Related Disciplines 7

2.1.2 Aggregation 8

2.1.2.1 Data Association Examples 9

2.1.2.2 Cardinal Aggregation Examples 9

2.2 SECURITY TERMINOLOGY 10

2.3 THE TRADITIONAL SECURITY ANALYSIS PARADIGM 12

2.4 THE DATABASE SECURITY ENGINEERING PROCESS 14

2.5 THE REQUIREMENTS FOR PROTECTING AGAINST INFERENCE
AND AGGREGATION 15

2.5.1 The Operational Requirement 16

2.5.2 The TCSEC Requirement 16

3.0 AN INFERENCE AND AGGREGATION FRAMEWORK 18

3.1 DETAILED EXAMPLE 18

3.2 THE FRAMEWORK 20

3.2.1 Information Protection Requirements and Resulting Vulnerabilities 20

3.2.2 Classification Rules 20

3.2.3 Vulnerabilities from Classification Rules 25

3.2.4 Database Design 26

3.2.5 Vulnerabilities from Database Design 27

3.2.6 Database Instantiation and Vulnerabilities 28

3.3 OVERVIEW OF RESEARCH 28

4.0 APPROACHES FOR INFERENCE CONTROL 31

4.1 FORMALIZATIONS 31

4.1.1 Database Partitioning 31

4.1.2 Set Theory 32

4.1.3 Classical Information Theory 33

4.1.4 Functional and Multivalued Dependencies 33

4.1.5 Sphere of Influence 34

4.1.6 The Fact Model 34

4.1.7 Inference Secure 35

4.1.8 An Inference Paradigm 35

4.1.9 Imprecise Inference Control 36

4.2 DATABASE DESIGN TECHNIQUES 37

4.2.1 Inference Prevention Methods 37

4.2.1.1 Appropriate Labeling of Data 37

4.2.1.2 Appropriate Labeling Integrity Constraints 38

4.2.2 Inference Detection Methods 39

4.3 RUN-TIME MECHANISMS 40

4.3.1 Query Response Modification 41

4.3.1.1 Constraint Processors 41

4.3.1.2 Database Inference Controller 42

4.3.1.3 Secondary Path Monitoring 43

4.3.1.4 MLS System Restrictions 43

4.3.2 Polyinstantiation 44

4.3.3 Auditing Facility 44

4.3.4 Snapshot Facility 45

4.4 DATABASE DESIGN TOOLS 47

4.4.1 DISSECT 47

4.4.2 AERIE 48

4.4.3 Security Constraint Design Tool 49

4.4.4 Hypersemantic Data Model and Language 49

5.0 APPROACHES FOR AGGREGATION 51

5.1 DATA ASSOCIATION 51

5.1.1 Formalizations 52

5.1.1.1 Lin's Work 52

5.1.1.2 Cuppens' Method 54

5.1.2 Techniques 54

5.2 CARDINAL AGGREGATION 56

5.2.1 Techniques 56

5.2.1.1 Single Level DBMS Techniques 56

5.2.1.1.1 Database Design and Use of DAC 56

5.2.1.1.2 Use of Views 57

5.2.1.1.3 Restricting Ad Hoc Queries 58

5.2.1.1.4 Use of Audit 58

5.2.1.1.5 Store all Data System-High 59

5.2.1.2 Multilevel DBMS Techniques 59

5.2.1.2.1 Use of Labeled Views 59

5.2.1.2.2 Start with All Data System-High 60

5.2.1.2.3 Make Some Data High 60

5.2.2 Mechanisms 61

5.2.2.1 Aggregation Constraints 61

5.2.2.2 Aggregation Detection 61

6.0 SUMMARY 63

REFERENCES 65

LIST OF FIGURES

FIGURE PAGE

2.1: TRADITIONAL SECURITY ANALYSIS PARADIGM 13

2.2: DATABASE SECURITY ENGINEERING PROCESS 14

3.1: ENTITIES AND ASSOCIATIONS 19

3.2: ATC ATTRIBUTES 19

3.3: INFERENCE AND AGGREGATION FRAMEWORK 21

3.4: ATC EXAMPLE DATABASE DESIGN 27

4.1: PARTITIONING OF THE DATABASE FOR A USER U 32

4.2: DATABASE WITHOUT AN INFERENCE PROBLEM 32

4.3: DATABASE WITH AN INFERENCE PROBLEM 32

4.4: QUERY PROCESSOR 41

4.5: DBIC SYSTEM CONFIGURATION 42

4.6: GENERAL-PURPOSE DATABASE MANAGEMENT SYSTEM MODEL 46

4.7: DATABASE MANAGEMENT SYSTEM MODEL AT THE
U. S. BUREAU OF THE CENSUS 46

5.1: ATC COMPANY DATA ASSOCIATION EXAMPLES 55

LIST OF TABLES

TABLE PAGE

3.1: TAXONOMY OF PROTECTION REQUIREMENTS AND
CLASSIFICATION RULES 22

3.2: FORMAL DEFINITIONS 29

3.3: RESEARCH SUMMARY (PART 1) 29

3.3: RESEARCH SUMMARY (PART 2) 30

SECTION 1

INTRODUCTION

This document is the first volume in the series of companion documents to the Trusted Database
Management System Interpretation of the Trusted Computer System Evaluation Criteria [TDI 91;
DoD 85]. This document examines inference and aggregation issues in secure database
management systems and summarizes the research to date in these areas.

1.1 BACKGROUND AND PURPOSE

In 1991 the National Computer Security Center published the Trusted Database Management
System Interpretation (TDI) of the Trusted Computer System Evaluation Criteria (TCSEC). The
TDI, however, does not address many topics that are important to the design and development of
secure database management systems (DBMSs). These topics (such as inference, aggregation, and
database integrity) are being addressed by ongoing research and development. Since specific
techniques in these topic areas had not yet gained broad acceptance, the topics were considered
inappropriate for inclusion in the TDI.

The TDI is being supplemented by a series of companion documents to address these issues
specific to secure DBMSs. Each companion document focuses on one topic by describing the
problem, discussing the issues, and summarizing the research that has been done to date. The intent
of the series is to make it clear to DBMS vendors, system designers, evaluators, and researchers
what the issues are, the current approaches, their pros and cons, how they relate to a TCSEC/TDI
evaluation, and what specific areas require additional research. Although some guidance may be
presented, nothing contained within these documents should be interpreted as criteria.

These documents assume the reader understands basic DBMS concepts and relational database
terminology. A security background sufficient to use the TDI and TCSEC is also assumed;
however, fundamentals are discussed whenever a common understanding is important to the
discussion.

1.2 SCOPE

This document addresses inference and aggregation issues in secure DBMSs. It is the first of five
volumes in the series of TDI companion documents, which includes the following documents:

· Inference and Aggregation Issues in Secure Database Management Systems

· Entity and Referential Integrity Issues in Multilevel Secure Database Management
Systems [Entity 96]

· Polyinstantiation Issues in Multilevel Secure Database Management Systems [Poly 96]

· Auditing Issues in Secure Database Management Systems [Audit 96]

· Discretionary Access Control Issues in High Assurance Secure Database Management
Systems [DAC 96]

This series of documents uses terminology from the relational model to provide a common basis
for understanding the concepts presented. This does not mean that these concepts do not apply to
other database models and modeling paradigms.

1.3 INTRODUCTION TO INFERENCE AND AGGREGATION

Many definitions of inference and aggregation have been suggested in the literature. In fact, one of
the challenges to understanding inference and aggregation is that there are different (but sometimes
similar) notions of what they are. We will discuss the differences and similarities at length in later
sections. At this point, the following general definitions are sufficient:

Inference or the inference problem is that of users deducing (or inferring) higher level
information based upon lower, visible data [Morgenstern 88].

Aggregation or the aggregation problem occurs when classifying and protecting collections
of data that have a higher security level than any of the elements that comprise the aggregate
[Denning 87].

Inference and aggregation problems are not new, nor are they only applicable to DBMSs.
However, the problems are exacerbated in multilevel secure (MLS) DBMSs that label and enforce
a mandatory access control (MAC) policy on DBMS objects [Denning 86a].

Inference and aggregation differ from other security problems since the leakage of information is
not a result of the penetration of a security mechanism but rather it is based on the very nature of
the information and the semantics of the application being automated [Morgenstern 88].

1.4 AUDIENCES OF THIS DOCUMENT

This document is targeted at four primary audiences: the security research community, database
application developers/system integrators, trusted product vendors, and product evaluators.
Inference and aggregation are problems based on the value and semantics of data for the part of the
enterprise being automated. Understanding inference and aggregation is important for all of the
audience categories, but the audience for whom this document is most relevant includes those
involved in designing and engineering automated information systems (AIS)--in particular, the
database application developers/system integrators. In general, this document is intended to
present a basis for understanding the issues surrounding inference and aggregation especially in
the context of MLS DBMSs. Implemented approaches and ongoing research are examined.
Members of the specific audiences should expect to get the following from this document:

Researcher

This document describes the basic issues associated with inference and aggregation. Important
research contributions are discussed as various topics are examined. By presenting current theory
and debate, this discussion will help the research community understand the scope of the issue and
highlight approaches and alternative solutions to inference and aggregation problems. For
additional relevant work, the researcher should consult two associated TDI companion documents:
Polyinstantiation Issues in Multilevel Secure Database Management Systems [Poly 96] and Entity
and Referential Integrity Issues in Multilevel Secure Database Management Systems [Entity 96].

Database Application Developer/System Integrator

This document highlights the need for analysis of the application-dependent semantics of an
application domain in order to understand the impact of inference and aggregation on MLS
applications. It describes the basic issues and current research into providing effective solutions.

Trusted Product Vendor

This document describes the types of mechanisms from the research literature that can be
incorporated into products to assist application developers with enforcing aggregation-based
security policies. Research on tools to support inference analysis provides the basis for additional
tools that might be provided with a DBMS product.

Evaluator

This document presents an understanding of inference and aggregation issues and how they relate
to the evaluation of a candidate MLS DBMS implementation.

1.5 ORGANIZATION OF THIS DOCUMENT

The organization of the remainder of this document is as follows:

· Section 2 discusses several first principles that form the foundation for reasoning about
inference and aggregation. Included in Section 2 are subsections that define and explain
many aspects of inference and aggregation, present relevant security terminology (e.g.,
comparing inference to direct disclosure), and describe two fundamental processes: the
traditional security analysis paradigm and the database security engineering process.

· Section 3 presents a framework for reasoning about inference and aggregation based on
the two processes described in Section 2. Section 3 also introduces a detailed example that
is used throughout the document to illustrate the material. Section 3 concludes with a
matrix that shows how each research effort discussed in Sections 4 and 5 relate to the
framework.

· Section 4 summarizes research efforts on inference.

· Section 5 summarizes research on aggregation.

· Section 6 summarizes the contents of this document.

SECTION 2

BACKGROUND

Inference and aggregation are different but related problems. In order to understand what they are,
what problems they embody, and their relationship, it is essential to revisit several first principles.
In Section 2.1, we give further descriptions and discuss inference and aggregation in some detail,
including examples. Some important terminology and security concepts are covered in Section 2.2.
The next two sections describe fundamental processes that form the basis for reasoning about
inference and aggregation: the traditional security analysis paradigm (Section 2.3) and the database
security engineering process (Section 2.4). The final section summarizes the operational and
TCSEC requirements for inference and aggregation.

Although much of what will be discussed in this section will be considered by some as intuitively
obvious, it is these principles that are often forgotten and lead to misunderstandings. Even
experienced researchers and practitioners should review this section.

2.1 INFERENCE AND AGGREGATION - DEFINED AND EXPLAINED

This section provides further definitions and explanations of inference and aggregation problems.
Many examples are given to illustrate the relationship and complexity of these problems.

2.1.1 Inference

Inference, the inferring of new information from known information, is an everyday activity of
most humans. In fact, inference is a topic of interest for several disciplines, each from its own
perspective. In the context of this document:

Inference addresses the problem of preventing authorized users of an AIS from inferring
unauthorized information.

Preventing unauthorized inferences is complicated both by the variety of ways in which humans
infer new information and by the wide variety of information they use to make inferences.

Some researchers characterize the inference problems in terms of inference channels [Morgenstern
88; Meadows 88a]. SRI's research identifies three types of inference channels based on the degree
to which HIGH data may be inferred from LOW data [Garvey 94]:

· Deductive Inference Channel. The most restrictive channel, requiring a formal deductive
proof (described by propositional or first-order logic) showing that HIGH data can be
derived from LOW data.

· Abductive Inference Channel. A less restrictive channel where a deductive proof could be
completed assuming certain LOW-level axioms.

· Probabilistic Inference Channel. This channel occurs when it is possible to determine that
the likelihood of an unauthorized inference is greater than an acceptable limit.

These descriptions of inference channels combine the method by which the inference is made and
the information used to make the inference. Both of these areas of complexity are discussed further
in the following paragraphs. Section 2.1.1.3 identifies related disciplines and how they address
inference.

2.1.1.1 Inference Methods

One of the factors that makes the inference problem so difficult is that there are many methods by
which a human can infer information. The following list [Ford 90] illustrates the variety of
inference strategies that can be used:

· Inference by Deductive Reasoning New information is inferred using well-formed rules,
either through classical or non-classical logic deduction. Classical deductions are based on
basic rules of logic (e.g., assertion that "A is true" and if "A then B" deduces "B is true").
Non-classical logic-based deduction includes probabilistic reasoning, fuzzy reasoning,
non-monotonic reasoning, default reasoning, temporal logic, dynamic logic, and modal
logic-based reasoning.

· Inference by Inductive Reasoning Well formed rules are utilized to infer hypotheses from
observed examples.

· Inference by Analogical Reasoning Statements such as "X is like Y" are used to infer
properties of X given the properties of Y.

· Inference by Heuristic Reasoning New information is deduced using various heuristics.
Heuristics are not well defined and may be a "rule of thumb."

· Inference by Semantic Association The association between entities is inferred from the
knowledge of the entities themselves.

· Inferred Existence One can infer the existence of entity Y from certain information (e.g.,
from the information, "John lives in Boston," one can infer that "there is some entity called
John").

· Statistical Inference Information about an individual entity is inferred from various
statistics compared on a set of entities.

2.1.1.2 Inference Information

The other factor that makes preventing unauthorized inferences within a DBMS a difficult problem
is the wide variety of information that can be used to make an inference. The following are some
types of information that can be used to make inferences:

· Schema metadata including relation and attribute names

· Other metadata such as constraints (e.g., value constraints or other constraints that may be
implemented through triggers or procedures)

· Data stored in the relations (i.e., the value of the data)

· Statistical data derived from the database

· Other derived data that can be determined from data in the database

· The existence or absence of data (as opposed to the value of the data)

· Data disappearing (e.g., upgraded) or changing value over time

· Semantics of the data that are not stored in the database but are known within the
application domain

· Specialized information about the application domain (both process and data) not stored
in the database

· Common knowledge, and

· Common sense

Denning [Denning 86a] and many other researchers adopt a so-called "closed-world assumption"
that limits inference to the database. With this limitation, both the information used to make the
inference and the inferred information must be represented in the database. The first seven types
of information described above encompass information that would be included using this closed-
world assumption. Unfortunately, in the real world in which a DBMS must operate, significant
amounts of the remaining types of information will be used by an adversary to attempt to infer
unauthorized information.

2.1.1.3 Related Disciplines

Inference in the context of this document focuses on the information contained in a DBMS and
preventing inference of unauthorized information by authorized users of the DBMS. Inference is
also a key issue in several other disciplines including:

· Operations Security (OPSEC) OPSEC is concerned with denying external adversaries the
ability to infer classified information by observing unclassified indicators. The key
difference is that the adversary is external to the organization, while this document
addresses individuals that are internal to the organization (i.e., DBMS users).

· Artificial Intelligence (AI) The ability to infer new facts from existing knowledge is a
positive goal of several of the sub-disciplines of AI. The AI research community is looking
for better ways to represent knowledge and approaches for inferring new information.

· Uncertainty Management The inverse problem of preventing inferences is the need for
individuals, such as intelligence analysts, to infer information about an adversary based
upon evidence that can be collected. There are several approaches to evaluating evidence
in an effort to quantify one's confidence in the evidence and the resulting inferred
information [Schum 87].

2.1.2 Aggregation

Like the word inference, the word aggregation has multiple meanings in the information
technology community. An aggregation, or aggregate, is a collection of separate data items. In the
database context, data aggregation normally has a positive connotation. Aggregation is an
important and desirable side effect of organizing information into a database. Information has a
greater value when it is gathered together, made easily accessible, and structured in a way that
shows relationships between pieces of data [National Academy Press 83]. In some sense, every
database is an aggregation of data.

The term aggregation has a different, more common use in the commercial database management
community, where it usually refers to the use of aggregate operators. The five standard operations
are sum, average, count, maximum, and minimum. Others such as standard deviation or variance
may also be found in query languages [Ullman 88]. These operators are applied to columns of a
relation to obtain a single quantity. They allow conclusions to be drawn from the mass of data
present in a database.

The aggregate operators in commercial DBMSs relate closely to the operations performed on
statistical databases. Statistical databases are used to answer queries dealing with sums, averages,
and medians, and use aggregation to hide the separate data items found in individual records. The
concern in a statistical database is for security in the sense of privacy, not classification. Security
in statistical databases has a large body of literature and will only be discussed briefly in this
document.

In the context of this document:

Aggregation involves the situation where a collection of data items is explicitly considered to
be classified at a higher level than the classification assigned to the individual data elements
[Lunt 89].

Denning has identified two classes of the aggregation problem [Denning 87]. We describe them
here in terms of entities and their attributes.

· Attribute Associations Associations between instances of different data items of which
there are two sub-classes.

Associations between Entities Associations between instances of attributes of different
entities.

Associations between Attributes of an Entity Associations between instances of different
attributes of an entity.

· Size-based Record Associations Associations among multiple instances of the same entity.

We will use the current terminology to represent these classes of aggregation: Data Association
[Jajodia 92] for Attribute Associations (both Inter- and Intra-Entity) and Cardinal Aggregation
[Hinke 88] for Size-based Record Associations.

The difference between data association and cardinal aggregation is important. Data association
refers to aggregates of instances of different data objects; cardinal aggregation refers to aggregates
of instances of the same type of entity. Distinguishing between data association and cardinal
aggregation as two types of aggregation is also important since data association can be effectively
controlled using database design techniques with labeled data [Denning 87]. On the other hand,
effective mechanisms to enforce cardinal aggregation policies are still a research issue.

2.1.2.1 Data Association Examples

A traditional example of intra-entity data association involves salary: the identity of an employee
and the actual number that represents the employee's salary may both be unclassified, but the
association between a particular employee and the employee's salary is considered sensitive. This
is an instance of intra-entity data association as defined above: employee name (or other
identifying attributes such as employee number) and salary are both attributes of the employee
entity.

An example of inter-entity data association is the following: the identity of commercial companies
and government agencies are unclassified, but the association of a particular company as the
contractor for a particular government agency may be considered classified. In this case, the data
association is between two entities, commercial company and government agency.

2.1.2.2 Cardinal Aggregation Examples

The traditional example of cardinal aggregation is the intelligence agency telephone book:
individual entries are unclassified but the total book is classified [Lunt 89]. The challenge with this
example is understanding what information the agency is trying to protect with the policy. Is it the
name and number of staff personnel (and therefore capability) of each (or some) organizational
element? The total staffing size of the agency? Identification of special employees? Special skills?
A clear understanding of what information needs to be protected must precede a cardinal
aggregation policy.

Another example of a cardinal aggregation policy relates to an Army troop list, that is, a list of
Army organizations [Smith 91]. Information about each Army organization is unclassified (even
if the organization's mission is classified), but the list as a whole is classified. Getting someone to
determine how many organizations can be aggregated and still remain unclassified information is
difficult. If the intent is to protect the Army's war fighting capability, then aggregating the combat
arms units (as opposed to the support and civilian-based organizations) will directly disclose that
information. Of course, one has to know about the war fighting capabilities of each type of combat
arms unit, which is unclassified information and can be obtained externally to the troop list.
Unfortunately, there may be other ways to infer the unauthorized information. For example, if one
can determine the tactical communications capability (i.e., aggregation of communications units),
can one then infer the war fighting capability? And what about any number of other aggregates of
units where there is a reasonable understanding of their relationship to war fighting capability? It
quickly becomes a difficult problem to describe, let alone to provide mechanisms to solve this type
of aggregation problem. Yet keeping information about individual units unclassified is essential
for operational efficiency.

A final example is network management data: information about outages of individual circuits is
not sensitive, but the aggregation of information about all circuit outages for a network is sensitive.
In this case, one can postulate that this cardinal aggregation policy is attempting to protect the
identification of vulnerabilities of the network from an adversary that might want to disrupt or
render the network inoperative. With information about only one or two circuit outages, an
adversary cannot mount an effective attack to bring the network down. On the other hand, if the
adversary knows all the circuit outages, the network may have a temporary single point of failure,
a vulnerability that the adversary can exploit. It is operationally impractical to treat information
about each circuit as sensitive, yet in the face of a credible threat, it is important to protect the
aggregate information.

All of these example cardinal aggregation policies were established as a trade-off between the need
to protect sensitive information versus the need for operational efficiency. To some researchers,
cardinal aggregation policies are considered fundamentally unsound [Meadows 91). Thus, it
should not be surprising that effective mechanisms to implement what some consider an unsound
policy elude the community.

2.2 SECURITY TERMINOLOGY

This section discusses several security-relevant terms and concepts. We distinguish between two
ways in which information can be revealed:

· Direct Disclosure. Classified information is directly revealed to unauthorized individuals.

· Inference. Classified information may be deduced by unauthorized individuals with some
level of confidence often less than 100%.

The primary distinction, then, between direct disclosure and inference is whether. the information
is revealed (direct disclosure) or must be deduced (inference). For example, if a user at LOW can
execute a query to a DBMS and receive a response where the data displayed to the LOW user is
actually classified HIGH (i.e., based on an explicit classification rule), then this is a direct
disclosure problem not an inference problem. In most instances of inference, the adversary has
some uncertainty with respect to the accuracy of the inference of the unauthorized information.

For the purposes of this document, we do not distinguish between where the HIGH data that is
directly disclosed or inferred is located (i.e., in the database or not). The important consideration
is that HIGH data was directly disclosure or inferred. This consideration is independent of whether
the HIGH data is actually stored in the database or it exists somewhere else in the application
domain of discourse.

Data association policies are stated to prevent direct disclosure of classified information related to
an association between data elements. In most cases, cardinal aggregation policies are stated to
prevent inferences that could be made from the aggregate data.

The discussion above refers generically to "unauthorized" individuals. In fact, we must distinguish
between two types of unauthorized access to information:

· based on authorized access to classified information (i.e., a label-based policy); and

· based on informal "need-to-know" policies.

"Need-to-know" is a well-known concept and is defined in the DoD Information Security
Regulation [5100.1-R 86] as:

· "A determination made by a possessor of classified information that a prospective
recipient, in the interest of national security, has a requirement for access to, or knowledge,
or possession of the classified information in order to accomplish lawful and authorized
government purposes."

There are two distinct types of need-to-know: "informal" (sometimes called discretionary), and
"formal" need-to-know. With informal need-to-know, individuals may have the clearance (i.e., are
considered trustworthy to properly protect the information) but not the job-related need to actually
access the information. Informal need-to-know is enforced in the people and paper world by the
individual who physically controls access to a classified document. In an AIS, informal need-to-
know is normally enforced by discretionary access control (DAC) mechanisms as described in the
TCSEC. Formal need-to-know is enforced through compartments (or other release markings) to
which individuals are formally granted access. Mandatory access control (MAC) categories are
used in an AIS to enforce access based on formal need-to-know.

Direct disclosure is normally only thought of as being a problem when classified information is
revealed to an individual without the appropriate clearance, including formal access to
compartments (i.e., formal need-to-know). Given that U.S. Government information is classified
based on laws or executive orders, direct disclosure of classified information is punishable by
imprisonment. On the other hand, disclosing information in violation of an informal need-to-know
policy is normally not considered a compromise of classified information, since the individual has
the appropriate clearance to see the information and is trusted to properly protect it. Thus, a
violation of an informal need-to-know policy may result in an employee being fired or sued, but
not imprisoned.

Although an authorized user can attempt to infer information for which the user does not have an
informal need-to-know, research efforts to date in inference and aggregation have dealt exclusively
with direct disclosure or inference based on compromise of classified information, not on
violations of informal need-to-know policies. As such, this document focuses almost exclusively
on protecting classified information against inference and aggregation threats from individuals
who are not cleared to receive this information.

2.3 THE TRADITIONAL SECURITY ANALYSIS PARADIGM

To better understand inference and aggregation and how they relate to each other, they need to be
discussed in the context of the traditional security analysis paradigm shown in Figure 2.1. A brief
description of the activities in the paradigm follows:

· Identify information protection requirements The protection requirements traditionally
addressed by the security community relate to confidentiality. However, information
protection requirements for integrity and availability should also be specified. Although,
the inference and aggregation problems strictly address confidentiality protection
requirements, solutions to these problems typically adversely impact the ability to meet the
integrity and availability requirements.

· Identify vulnerabilities Given a set of protection requirements, one must identify the
vulnerabilities that exist which could be exploited to prevent those requirements from
being satisfied by the system. As the next section shows, vulnerabilities need to be
assessed continuously throughout the design and engineering process.

· Identify threats Given identified vulnerabilities, one must identify credible threat agents
that can exploit those vulnerabilities. For inference and aggregation, the primary threat is
the insider; an authorized user of the system who is authorized access to some, but not all,
information in the system. Malicious code is a secondary threat.

· Conduct a risk analysis The risk analysis must take into account the likelihood of a threat
agent actually exploiting a vulnerability. The risk analysis considers tradeoffs between
effectiveness, cost, and the ability to meet other requirements (e.g., integrity and
availability) to determine appropriate controls.

· Select controls Based on the risk analysis, controls are selected to eliminate or reduce a
threat agent's ability to exploit the vulnerability. Note that some controls are selected to
prevent while others are selected to reduce or detect a threat agent's ability to exploit the
vulnerability. This distinction is important in that most cases dealing with inference
involve determining during the risk analysis the likelihood of an individual inferring
unauthorized information given a particular set of controls. The goal is to establish
controls that provide an acceptable level of risk at an affordable cost, where cost is cast in
terms of both money and adverse impact on functionality, ease of use, and performance.

Figure 2.1: Traditional Security Analysis Paradigm

Details of how this process is related to inference and aggregation are discussed in the next section.

2.4 THE DATABASE SECURITY ENGINEERING PROCESS

The process of implementing an AIS with a database that meets the specified security requirements
is shown in Figure 2.2. The process should begin with a clear understanding of what information
needs to be protected, in this case from unauthorized disclosure. If one does not have a clear
understanding of what has to be protected, then it makes secure database design difficult.
Unfortunately, many inference and aggregation problems are stated at the database design level
(e.g., with sensitivity labels on tables) without a description of what information is to be protected
from disclosure.

Given a precise statement of the information to protect, there are two distinct aspects of
establishing rules for implementing those confidentiality protection requirements:

· classification rules must be established that specify the confidentiality properties of the
information, and

· access control rules must be established that specify which authorized users can access
information based on the classification rules.

Figure 2.2: Database Security Engineering Process

Classification rules should address all information that needs to be protected, not just information
that must be protected for national security reasons (i.e., Confidential, Secret, and Top Secret)
based on Executive Order 12356. For example, the Computer Security Act of 1987 requires
unclassified but sensitive information have confidentiality protection. Categories of unclassified
but sensitive information include, but are not limited to, personnel (i.e., subject to the Privacy Act
of 1984), financial, acquisition contracting, and investigations. Certain other information is
classified under the Atomic Energy Act of 1954 as Restricted Data or Formerly Restricted Data.

Confidentiality protection requirements are also applicable to non-government organizations. For
example, proprietary information of many types may be considered "bet your company" data and
needs to be protected to ensure the viability of the company.

Given a set of classification rules, the database designer must provide a database design which
labels data objects and correctly implements the classification rules. The database instantiation
represents the database as it is populated with data and used.

Access control rules are generically described as MAC for access based on the military
classification system (including formal need-to-know) and DAC for access based on informal
need-to-know. However, to effectively implement the protection requirements, one must explicitly
identify the application-dependent information which identifies which authorized users, or classes
of users (e.g., roles), are authorized access to each type of information identified in the
classification rules.

The access control rules must be incorporated into the application system design. Although the
access control rules can be implemented through mechanisms in the operating system or
application software, inference and aggregation concerns are primarily addressed by DBMS access
controls.

Classification rules are normally found in a Classification Guideline as required by the DoD
Information Security Program Regulation [DoD 5100.1-R]. Generic access control rules are
documented in the system security policy as required by the TCSEC. The instantiation of MAC
and DAC with application-dependent access control rules that identify specific users and data is
usually documented in a Security Concept of Operations.

The classification rules for an application domain need to be complete. That is, all information to
be contained in the support information system should be explicitly addressed.

2.5 THE REQUIREMENTS FOR PROTECTING AGAINST INFERENCE AND
AGGREGATION

In this section, we identify both operational and TCSEC requirements related to protecting against
inference and aggregation. Operational requirements, identified in Section 2.5.1, include more than
just the need to protect against inference and aggregation but also operational considerations, or
tradeoffs, that impact decisions on how, or to what extent, inference and aggregation protections
are implemented. Section 2.5.2 discusses the lack of direct TCSEC requirements for protecting
against inference and aggregation.

2.5.1 The Operational Requirement

There are at least four operational requirements related to inference and aggregation protection.

1. The enterprise's information must be properly protected from disclosure either directly or
through inference.

2. Employees (i.e., users) must have access to the information they need to do their jobs.

3. The enterprise must conduct its operations in an efficient and cost-effective manner.

4. The enterprise must interact efficiently with external organizations, which may involve
mostly unclassified information.

Unfortunately, these operational requirements often are in conflict. Over-classifying data to meet
the first requirement may reduce availability of information to certain employees (conflicts with
second requirement). Cost-effective operations (the third requirement) may demand granting the
lowest clearances possible to employees, which may conflict with the established classification
rules to protect information (first requirement). Finally, the need to protect information (first
requirement) may conflict with the efficient interface with other organizations (fourth
requirement). This last conflict is the genesis of most cardinal aggregation policies.

2.5.2 The TCSEC Requirement

There are no requirements in the TCSEC, TDI, or Trusted Network Interpretation (TNI) [TNI 87]
that directly address inference and aggregation issues. Inference and aggregation are properties
intrinsic to data values and to the structure of a database (schema, metadata). Inference and
aggregation are based on an interpretation of the enterprise's security policy and on specific
requirements of the application domain being automated. Thus, inference and aggregation should
best be addressed by the system developer, acquisition agency, and certification and accreditation
authorities for each system and should not be looked at during a TCSEC /TDI evaluation.

Since inference is often characterized as an inference channel, there is often some confusion as to
the relationship between inference and covert channels. Inference and aggregation are possible
because of the existence of a set of specific known data values and a set of logical implication rules.
Knowledgeable database administrators or information security officers may be able to apply
inferential engines to the analysis of a populated database to uncover a set of potential unauthorized
inferences which they can then protect.

However, new inferences may be introduced through error on the part of users authorized to update
the database. It is important to note that the introduction of inference or aggregation is a data value-
based problem (including metadata that represents the database design), and not one of DBMS
(i.e., the product) design or implementation. Unlike Trojan horses or covert channels, values
needed to support an inferential attack are introduced consciously by benign actions of users, and
not clandestinely by malicious implanted code or by flaws in a DBMS trusted computing base
(TCB) design or implementation. In this sense, inference problems are different from covert
channel issues since all the data used to make the inference are at LOW, and there is no need for
an active agent at HIGH to signal information from HIGH to LOW [Jajodia 92]. Note also that
inference will still produce the same result, independent of whether the compromised (inferred)
data remains in (or ever existed in) the database at HIGH, as opposed to the need for HIGH data
to be present in order for it to be signaled to LOW by a covert channel.

SECTION 3

AN INFERENCE AND AGGREGATION FRAMEWORK

The purpose of this section is to present a framework to reason about inference and aggregation.
Section 3.1 begins with a description of a detailed example application domain that will be used to
illustrate parts of the framework. Section 3.2 describes the framework using instances from the
detailed example for illustration. Section 3.3 presents a matrix which summarizes how the research
efforts described in Sections 4 and 5 relate to the framework.

3.1 DETAILED EXAMPLE

This section introduces a detailed example used to illustrate the concepts presented in this
document. We use a mythical company, the Acme Technology Company or ATC. ATC contracts
with many government agencies and commercial companies to provide technology-oriented
products and services. ATC also conducts extensive applied research for its clients in several areas
of technology. ATC needs extensive support for its business operations by AISs that must represent
information about:

· entities of interest in the application domain, and

· functional or business processes used in the application domain.

Although there are other security considerations that should be addressed in designing functional
processes, for the purpose of this document, we will only address the process view as it relates to
access control.

We are most interested in understanding and modeling the information about entities. Tn
particular, we are concerned with identifying the entities, associations, and attributes of interest to
ATC.

Entities (real objects or concepts of interest):

· Employees ATC employs a wide variety of personnel from clerks to very specialized
engineers.

· Education Data about the education of each ATC employee is retained.

· Research Projects Although ATC has thousands of projects that cover a wide variety of
purposes, research projects represent a special category of projects, especially from a
security standpoint.

· Critical Technologies Certain technologies are particularly sensitive.

· Clients Information about the organization that sponsors a project.

Associations (relationships between instances of the entities). Figure 3.1 shows the following
associations (in italics) between entities:

· Employees work-on Projects.

· Employees have Education.

· A Client sponsors Projects.

· Some Projects use Critical Technologies.

Figure 3.1: Entities and Associations

Figure 3.2.: ATC Attributes

Attributes (characteristics or properties of an entity). The addition of attributes adds more detail
and, therefore, realistic complexity to the example. Figure 3.2 shows selected attributes for each
ATC entity.

3.2 THE FRAMEWORK

The framework, shown in Figure 3.3, is an extension to the database security engineering process
described in Section 2.4. The framework provides the context to discuss how inference and
aggregation are applicable at each step in the development process. Given that inference and
aggregation are inherently application-dependent, examples, based on the ATC example in the
previous section, are given to illustrate the concepts. This framework with examples provides a
basis to illustrate and compare the proposed approaches described in Sections 4 and 5.

3.2.1 Information Protection Requirements and Resulting Vulnerabilities

There are at least three types of information protection requirements: confidentiality, integrity, and
availability. Inference and aggregation relate exclusively to the confidentiality objective of
security. Therefore, although integrity and availability protection requirements are also important,
this document will only address confidentiality requirements.

Given a set of information confidentiality protection requirements, there are two vulnerabilities
that must be addressed by appropriate controls: direct disclosure and inference. Both of these
vulnerabilities were discussed in Section 2.2. Aggregation is not yet a problem, because the
protection requirements address specific facts or information that must be protected as opposed to
a protection requirement for an aggregate of data.

3.2.2 Classification Rules

Classification rules for information are stated such that direct disclosure will be prevented, and the
ability to infer unauthorized information is reduced to an acceptable level. Preventing direct
disclosure is fairly straightforward: state classification rules for entities, attributes, and/or
associations and then restrict access based on user clearance and data classification. Unfortunately,
in practice, protection requirements can be so complex that ensuring that a set of classification rules
are correct (i.e., consistent and complete) can be difficult.

Each of the following types of classification rules from Figure 3.3 is described in the following
paragraphs:

· entity;

· attribute;

· data association; and

· cardinal aggregation.

Figure 3.3: Inference and Aggregation Framework

Table 3.1 summarizes this section by showing a taxonomy of types of confidentiality protection
requirements along with examples of protection requirements and supporting classification rules
for the ATC example. The ATC classification rules from Table 3.1 are used to illustrate the types
of classification rules described in this section. The types of protection requirements are a synthesis
of those found in papers by Smith, Pernul, and Burns, [Smith 90a; Pernul 93; Burns 92] and are
meant to be illustrative of the primary types of protection requirements but not necessarily an
exhaustive taxonomy.

Type Protection

Example Protection Requirement

Example Classification Rules

Entity Classification

1a

existence of all
instances of an
entity

· the fact that ATC is involved with
Critical Technologies must be protected
at LOW

· the existence of all instances of Critical
Technologies entity are classified LOW

1b

existence of
selected instances
of an entity

· the fact that a Research Project involves
Critical Technologies must be protected
at HIGH

· the existence of Research Projects that
use Critical Technologies are classified
LOW

· the existence of instances of all other
entities is UNCLASSIFIED

2a

identity of all
instances of an
entity

· the identity of Critical Technologies
must be protected at HIGH

· the identity of Clients of Research
Projects must be protected at LOW
(this is a requirement of the client
organization)

· the value of all instances (i.e., all
attributes) of the Critical Technologies
entity is classified HIGH

· the value of all instances (i.e., all
attributes) of the Client entity is
classified LOW

2b

identity of
selected instances
of an entity

· selected Research Projects will be
protected at LOW based on the
contents of Project Description as
determined by the classification
authority

· all attributes of the Research Project
entity are classified LOW when the
Description attribute contains LOW
information

· the identity of instances of all other
entities is UNCLASSIFED

Attribute Classification

3

existence of an
attribute (could
also be of
selected instances
of an attribute)

· the fact that specific Target Platforms
are identified for Research Projects is to
be protected at HIGH

· the existence of the Target Platform
attribute of Research Project is
classified HIGH

· all instances of the Target Platform
attribute are classified HIGH

· the existence of instances of all other
attributes is UNCLASSIFIED except
privacy information

4

value of an
attribute

· privacy information for employees (i.e.,
home address and phone number) is to
be protected at LOW

· the existence of the privacy information
attributes of employees is classified
LOW

Data Association

5

the association
between two
entities or two
attributes of
different entities

· the association between a specific
project and the sponsoring client must
be protected at HIGH

· the association between the Project
entity and the Client entity is HIGH

· the association between all other
entities is UNCLASSIFIED

6

the association
between two
attributes of the
same entry

· the association between name and
salary is to be protected at LOW

· the association between the Salary
attribute of the Employee entity and
Employee identifying attributes (name,
number) is LOW

Cardinal Aggregation

7

a collection of
instances of the
same entry

· the identity of Critical Technologies
must protected at HIGH

· the education entity and all its attributes
for all employees are UNCLASSIFIED

· the collection of the Degree and Major
attributes for all employees is LOW

Table 3.1: Taxonomy of Protection Requirements and Classification Rules

Entity classification rules can address the need to protect both the identity as well as the existence
of instances of entities of interest. In some operational environments, the fact that an entity (or
some selected instances of an entity) exists would disclose, directly or through an inference,
classified information. In many other operational environments, knowing that something classified
exists is not a problem; but you must not be able to know the classified information (i.e., its identity
or value). A good example is a compendium of documents including some that are classified. The
reader can know that the documents exist without having the clearance to read them. Entity
classification rules from the ATC example include:

· The existence of all instances of the Critical Technologies entity are classified LOW.

· The values of all instances (i.e., all attributes) of the Critical Technologies entity are
classified HIGH.

· All attributes of the Research Project entity are classified LOW when the Description
attribute contains LOW information.

Attribute classification rules are similar to entity classification, except that they address specific
attributes of an entity. Attribute classification rules from the ATC example include:

· The existence of the Target Platform attribute of Research Project is classified HIGH.

· Instances of the Employee entity are UNCLASSIFIED, but the value (not existence) of
two attributes, Home Address and Home Phone#, are classified LOW.

Data Association classification rules are stated when a relationship between entities or attributes is
considered more sensitive than information about the entities or attributes. This type of
classification rule implements inter- and intra-entity aggregation as discussed in Section 2. The
ATC example includes these data association rules:

· The association between a specific project (classified LOW) and the sponsoring client
(classified LOW) must be protected at HIGH.

· The association between an employee and the employee's salary is to be protected at
LOW.

These three types of classification rules (entity, attribute, and data association), along with
appropriate access control rules (e.g., the military classification system) will define the basis for
preventing direct disclosure. Direct disclosure can be prevented if the classification rules correctly
implement the protection requirements (i.e., are consistent and complete) and if the AIS correctly
implements the rules.

On the other hand, eliminating inferences is difficult, if not impossible, in that an individual's
capability to infer information is based on their already acquired knowledge and the innovative
ways in which information can be combined [Morgenstern 88]. Much of the knowledge that may
be used to infer unauthorized information is normally outside of the system in the form of domain-
specific knowledge or even common sense. At one end of the spectrum, one can ensure that no
unauthorized inferences are possible by classifying all information at HIGH and then giving all
users a HIGH clearance. Although this approach solves the inference problem (since every user is
authorized to access all information in the system), we now have introduced new vulnerabilities
(e.g., users have clearances they should not have, and therefore access to information they should
not see) and potentially decreased the operational efficiency of the organization (e.g., having to
manually downgrade a large percentage of the information produced by the system).

Establishing classification rules that will reduce unwanted inferences involves a constant trade-off
between the desire to raise the classification of data to reduce inferences and the desire to maintain
or even lower the classification of data to improve operational efficiency without directly
disclosing classified data to those not cleared to see it. This process of evaluating the relative
advantages of over-classifying information to prevent inferences is part of the risk analysis activity
identified in Section 2.3.

At this point, a fourth type of classification rule, cardinal aggregation, is often added. Cardinal
aggregation rules address the classification of collections of instances of the same entity. Often,
these rules are established to reduce inference. At this point in the framework we can conclude the
high-level relationships between inference and aggregation are:

· inference is a vulnerability, e.g., authorized users can deduce information not authorized
by the protection requirements, and

· aggregation is a class of classification policies that are stated as one form of control, in
many cases, to reduce the inference vulnerability.

The ATC example includes a cardinal aggregation classification rule:

· the education entity and each of its attributes for all employees are individually
UNCLASSIFIED, but

· the collection of the Degree and Major attributes of all instances of the education entity is
HIGH.

The ATC cardinal aggregation classification rule illustrates the difficulty of enforcing the intent of
the rule, even though we have specified the information to be protected (i.e., the identity of Critical
Technologies must be protected at HIGH). The difficulty is to determine how many employees'
educational information has to be aggregated to allow an inference. Is it all employees on a
particular project? Ten employees? Ten percent of the employees? All employees with blue eyes?
To have an effective implementation of a cardinal aggregation classification rule, someone has to
state explicit criteria for which collections must be considered HIGH. Unfortunately, cardinal
aggregation classification rules introduce a new vulnerability, as discussed in the next section.

At this point it is appropriate to distinguish between an aggregation policy, such as those discussed
above, and a Chinese Wall policy. Brewer and Nash introduced an important commercial security
policy called the Chinese Wall policy [Brewer 89]. They described the policy as follows:

It can most easily be visualized as the code of practice that must be followed by a market
analyst working for a financial institution providing corporate business services. Such an
analyst must uphold the confidentiality of information provided to him by his firm's clients;
this means that he cannot advise corporations where he has insider knowledge of the plans,
status, or standing of a competitor. However, the analyst is free to advise corporations which
are not in competition with each other, and also draw on general market information.

This type of policy seems to involve aggregates of information and therefore is sometimes
considered an "aggregation problem." In fact, the Chinese Wall policy does not meet the definition
of an aggregation problem; there is no notion of some information being sensitive with the
aggregate being more sensitive. The Chinese Wall policy is an access control policy where the
access control rule is not based just on the sensitivity of the information, but is based on the
information already accessed. The situation can be considered similar to having compartmented
data and subjects which can be cleared for exactly one compartment. The choice of which
compartment the subject is cleared for is effectively made when the first compartmented data is
viewed.

3.2.3 Vulnerabilities from Classification Rules

Figure 3.3 identifies three residual vulnerabilities given a set of classification rules designed to
prevent direct disclosure and reduce inference (i.e., the vulnerabilities of information
confidentiality protection requirements):

· Flawed Rules that allow Direct Disclosure With complex operational environments, the
classification rules can be flawed with respect to how faithfully they implement the
information confidentiality protection requirements. They can be inconsistent, incomplete,
or just incorrect such that they allow direct disclosure.

· Flawed Rules that allow an Unintended Inference Although the entire set of classification
rules are intended to prevent unwanted inferences, when faced with a complex set of
classification rules there may be opportunities for unintended inferences. Again, this may
be caused by rules that are inconsistent, incomplete, or incorrect.

· Cardinal Aggregation If a cardinal aggregation rule has been stated that classifies an
aggregate as LOW but individual elements are UNCLASSIFIED, then a vulnerability has
been added since UNCLASSIFIED users may be able to assemble aggregates in an
unintended or undetectable way.

3.2.4 Database Design

Database designers and application developers attempt to implement classification rules by
accurately representing and attaching security labels to database objects. The Relational Model
[Codd 70], one of several data models, is the model of choice for existing commercial MLS
products. Figure 3.4 shows a relational model representation of the ATC example using tuple-level
labeling.

The following are examples of how each type of classification rule can be implemented in a
database design:

· The rule, "the existence of all instances of the Critical Technologies entity is classified
LOW," is implemented by classifying the Critical Technologies table metadata at LOW.
This is done by the database designer.

· The rule, "the value of all instances of the Critical Technologies entity is classified
HIGH," is implemented by labeling each tuple in the Critical Technologies at HIGH. This
is enforced by the user entering the data.

· The rule, "all attributes of the Research Project entity are classified LOW when the
Description attribute contains LOW information," is implemented by having tuples
labeled UNCLASSIFIED or LOW based on the content of the Description attribute. This
is enforced by the user entering the data.

· The rule, "the existence of Research Projects that use Critical Technologies are classified
HIGH," is implemented by classifying each Research Project tuple at HIGH when the
project uses a Critical Technology. This is enforced by the user entering the data.

· The rule, "the existence of the Target Platform attribute of Research Project is classified
HIGH," is implemented by decomposing the Target Platform attribute (which would
normally be an attribute in the Research Project table) into a separate table, Project-Target.
All tuples in Project-Target are classified HIGH. Note that the metadata can also be
classified HIGH to prevent obtaining the knowledge that projects have a target platform
that could contribute to unauthorized inferences.

· The rule, "all instances of the Target Platform attribute are classified HIGH," from an
implementation standpoint, is redundant with the previous rule. Thus, this rule requires no
changes to the database design.

· The rules for data association between the Project and Client entities are, "the value of all
instances (i.e., all attributes) of the Client entity is classified LOW," "the value of all
instances of the Project entity (except those that use Critical Technologies) are classified
LOW," and "the association between the Project entity and the Client entity is HIGH."
These rules are implemented by labeling tuples (and metadata) in the Client table as LOW,
labeling tuples in the Project table as LOW (except those that are labeled HIGH because
they use Critical Technologies), and constructing an additional table, Client-Project, to
represent the data association between Clients and Projects. The Client-Project table
(metadata and tuples) is classified HIGH. Note that in the absence of a data association
rule, the association between Project and Client tables would be represented by a foreign
key from the Project table to the Client table.

Figure 3.4: ATC Example Database Design

· The data association rule, "the association between Salary attribute of the Employee entity
and Employee identifying attributes (name and number) is LOW," is implemented by
decomposing the Salary attribute into a separate table, Employee-Salary. All tuples in the
Employee-Salary table are labeled LOW.

· The cardinal aggregation rule established by the rules, "the education entity and all its
attributes for all employees are UNCLASSIFIED" and "the collection of the Degree and
Major attributes of the entity for all employees is LOW," is implemented by labeling all
tuples in the Employee table as UNCLASSIFIED and requiring some (unspecified)
mechanism to control access to multiple records.

3.2.5 Vulnerabilities from Database Design

Given a database design that attempts to faithfully represent the established classification rules,
four vulnerabilities are identified in Table 3.3:

· Faithful Representation of Flawed Rules A database design that faithfully implements a
flawed set of classification rules will not fix the flawed rules.

· Flawed Design Allowing Direct Disclosure A flawed design may allow a direct disclosure
at the schema level for uniformly classified data objects or at the content level for data
objects where instances may be classified at one of several levels.

· Flawed Design Allowing Unintended Inference The flawed design may allow unintended
inferences at the schema level for uniformly classified data objects, or at the content level
for data objects where instances may be classified at one of several levels.

· Cardinal Aggregation Labeled LOW If the data is labeled LOW, then a vulnerability
exists, since LOW users may be able to access aggregates in an unintended way.

These vulnerabilities are related to the values, application semantics, and the security labels
assigned to data objects, not to the functionality (beyond enforcing a MAC policy) or assurance
provided by the TCB. The TCSEC class of a TCB that enforces MAC has no impact on countering
these vulnerabilities. An Al TCB is just as effective (or ineffective) as a B1 TCB in countering
these vulnerabilities.

3.2.6 Database Instantiation and Vulnerabilities

For completeness, Table 3.3 shows the database instantiation as the final activity. Databases are
populated with data, and users use the database to conduct their job tasks. The database
administrator or security officer is tasked with ensuring the system is operating in a secure manner.
Assuming clearances are properly assigned to users, two vulnerabilities are identified:

· Inappropriate Labeling Allowing Disclosure Current MLS systems, including DBMSs,
depend on the user being logged in at the access class that correctly represents the
classification of the data being added to the database. If a user with a HIGH clearance
enters HIGH data when the user's current access class is LOW, then HIGH data has been
directly disclosed, since users with only a LOW clearance can access the HIGH data that
is incorrectly marked LOW.

· Inappropriate Labeling Allowing Unintended Inference In a similar manner, a user with a
HIGH clearance may enter data at LOW that will allow other users with only a LOW
clearance to infer HIGH data.

The TCSEC class of a TCB that enforces MAC also has no impact on countering these
vulnerabilities.

3.3 OVERVIEW OF RESEARCH

This section provides a brief overview of the research that will be discussed in Sections 4 and 5.
Each of the papers and research efforts on inference and aggregation address one or more parts of
the framework. Table 3.2 lists formal approaches to defining the problems. Table 3.3 maps
inference and aggregation research efforts against the framework in three categories (techniques,
mechanisms, and analysis tools). As Table 3.3 shows, much of the research has been in developing
techniques and designing mechanisms. There has also been limited work in developing tools to
help address inference and aggregation. The remainder of the table shows which vulnerabilities
each of the analysis tools and mechanisms attempt to address.

Inference

Aggregation

Database Partitioning (Denning)

Set Theory (Goguen and Meseguer)

INFER Function (Denning and Morgenstern)

FD and MVD (Su and Ozsoyoglu)

Sphere of Influence (Morgenstern)

Fact Model (Cordingley and Sowerbutts)

Inference Secure (Lin)

Inference Secure (Marks)

Imprecise Inference Control (Hale)

Security Algebra (Lin)

Modal Logic Framework (Cuppens)

Table 3.2: Formal Definitions

Techniques

Mechanisms

Analysis Tools

Classification Rules

Entity and Attribute

Direct Association

Database Design
(Denning, Lunt)

Cardinal
Aggregation

Classification Rule Vulnerabilities

Direct Disclosure

Unintended
inference

Cardinal
Aggregation Rules

Faithfully
implements flawed
classification rules

Table 3.3: Research Summary (Part 1)

Techniques

Mechanisms

Analysis Tools

Data Design Vulnerabilities

Direct Disclosure at
schema level

Secondary Path Analysis
(Hinke, Binns)

Direct disclosure at
content level

Unintended
inference at schema
level

Basic Security Principle
(Meadows)

Semantic Data Models
(Buczkowski, Hinke,
Smith)

Classification
Constraints (Akl)

Constraint Satisfaction
(Morgenstern)

Integrity Constraints
(Denning)

Secondary Path Analysis
(Hinke, Binns)

DISSECT
(Garvey et al.)

AERIE
(Hinke and Delugach)

Constraint Processor
(Thuraisingham et al.)

Hyper Semantic Model
(Marks et al.)

Unintended
inference at content
level

Cardinal
Aggregation-LOW
Data

Separae Structures
(Lunt)

Use of Views (Wilson)

Restrict Ad hoc Queries

Use of Audit

Some or All Data HIGH

Aggregation Constraints
(Haigh and Stachour)

Aggregation Detection
(Motro et al)

Database Instantiation Vulnerabilities

Labeling allows
disclosure

Constraint Processor
(Thuraisingham et al.)

DB Inference Controller
(Buczkowski)

Secondary Path
Monitoring (Binns)

Polyinstantiation
(Denning)

Auditing Facility
(Haigh and Stachour)

Snapshot Facility
(Jajodia)

Labeling allows
unintended inference

Table 3.3: Research Summary (Part 2)

SECTION 4

APPROACHES FOR INFERENCE CONTROL

Providing a solution to the inference problem is beyond the capability of currently available MLS
database management systems. It is also recognized that a general solution to the inference
problem is not possible given the inability to account for external knowledge and human reasoning
[Binns 94a]. To provide partial solutions, the inference problem must be bounded.

Section 4.1 identifies several efforts by researchers to formally define the inference problem. These
approaches are summarized to provide different perspectives on the problem. Implementing
theoretical inference solutions has proven to be difficult. Section 4.2 describes database design
techniques that could be used to prevent inference problems from occurring. Mechanisms for
inference prevention and detection during run-time are described in Section 4.3. Finally, Section
4.4 describes several tools that assist the database designer to detect and avoid inference channels.

4.1 FORMALIZATIONS

Researchers have proposed numerous ways to characterize or define the inference problem. This
section presents several formal definitions from the ongoing research into the discovery of
fundamental laws that determine whether the potential for undesirable inferences exist.

4.1.1 Database Partitioning

Database partitioning is used by Denning to define the inference problem [Denning 86a]. For each
user *, the data in the database can be partitioned into two sets: a VISIBLE set and an INVISIBLE
set, as in Figure 4.1. The user * is allowed to access only elements from the VISIBLE set, and in
is not allowed to know what elements are in the INVISIBLE set. (Note that if VISIBLE is null, the
user has no access to database information and, therefore, the problem is outside the domain of the
DBMS. In what follows, we assume that VISIBLE is not null.) Let KNOWN denote the set of data
elements that is known to *. The set KNOWN is constructed by the user either as a result of the
previous queries or as a result of some external knowledge that in possesses. There is no inference
problem if the intersection of the two sets INVISIBLE and KNOWN is empty, as in Figure 4.2.
There is an inference problem if the two sets INVISIBLE and KNOWN intersect, as in Figure 4.3.

Figure 4.1: Partitioning of the Database for a User U

Figure 4.2: Database without an Inference Problem

Figure 4.3: Database with an Inference Problem

4.1.2 Set Theory

Another formulation of the inference problem has been given by Goguen and Meseguer [Goguen
84]. Consider a database in which each data item is given an access class, and suppose that the set
of access classes is partially ordered. Define the relation > as follows: Given data items x and y,
relation x > y is said to hold if it is possible to infer y from x. The relation > is reflexive and
transitive. A set S is said to be inferentially closed if whenever x is in S and x > y holds, then y
belongs to S as well. Now, for an access class L, let E(L) denote the set consisting of all possible
responses that are classified at access classes dominated by L. There is an inference problem if E(L)
is not inferentially closed.

Goguen and Meseguer do not set forth any one candidate for the relation >; they merely require
that it be reflexive and transitive, and say that it will probably be generated according to some set
of rules of inference (for example, first order logic, statistical inference, monotonic logic, and
knowledge-based inference). They do note, however, that for most inference systems of interest,
determining that A > b (where A is a set of facts and b is a fact) is at best semidecidable; that is,
there is an algorithm that will give the answer in a finite amount of time if A > b, but otherwise
may never halt.

4.1.3 Classical Information Theory

Yet another definition that uses classical information theory has been given by Denning and
Morgenstern [Denning 86a]. Given two data items x and y, let H(y) denote the uncertainty of y, and
let Hx (y) denote the uncertainty of y given x (where uncertainty is defined in the usual information-
theoretic way). Then, the reduction in uncertainty of y given x is defined as follows:

The value of INFER (x > y) is between 0 and 1. If the value is 0, then it is impossible to infer any
information about y from x. If the value is between 0 and 1, then y becomes somewhat more likely
given x. If the value is 1, then y can be inferred given x.

This formulation is especially useful since it shows that inference is not an absolute problem, but
a relative one. It provides a way to quantify the bandwidth of the illegal information flow. On the
other hand, Denning and Morgenstern point out its serious drawbacks [Denning 86a]:

1. It is difficult, if not impossible, to determine the value of Hx (y).

2. It does not take into account the computational complexity that is required to draw the
inference.

To illustrate the second point, they give the following example from cryptography: With few
exceptions, the original text can be inferred from the encrypted text by trying all possible keys (so
Hx (y) = 1 in this case); however, it is hardly practical to do so.

4.1.4 Functional and Multivalued Dependencies

Su and Ozsoyoglu have studied inference problems that arise because of the functional and
multivalued dependencies that are constraints over the attributes of a relation [Su 86,87,90].
Functional dependencies are defined below; the reader is referred to Date for a definition of
multivalued dependencies [Date 86].

Let R be a relation scheme defined over attributes *, and let X and Y be subsets of in. A functional
dependency X > Y is said to hold in R if for any relation r, the current instance for R, r does not
contain two tuples with the same values for X, but different values for Y; that is, given any pair of
tuples t and s of r, t[X] = t[X] implies t[Y] = s[Y].

Su and Ozsoyoglu illustrate how inference problems can arise if certain functional dependencies
are known to the low-level users. If attributes are assigned security labels in a manner that is
consistent with the functional dependencies, then these inference threats can be eliminated. This
process is formalized by Su and Ozsoyoglu.

4.1.5 Sphere of Influence

Morgenstern expands upon the INFER function (described in Section 4.1.3) to propose a
theoretical foundation for inference [Morgenstern 87,88]. He proposes a framework for the
analysis of logical inferences and for the determination of logical inference channels. Morgenstern
introduces the concept of a sphere of influence (SOI) and inference channel.

The SOI delimits the scope of possible inferences given some base data, called the core that may
consist of entities, attributes, relationships, and constraints. Specifically, the sphere of influence
relative to some core, i.e., SOI (core), is the set of all information that can be inferred from the core.
The SOI is defined in terms of the INFER function. The SOI models the process by which a user's
knowledge of an application can give rise to inferences about additional information. The SOI
utilizes a forward chaining inference process from the given core to determine the scope of such
inferable data.

Morgenstern's concept of inference channel serves to isolate the lower level data which could give
rise to inferences about higher level data H. The computation of an inference channel is a
backward-chaining inference process from some resultant data H to determine all information
which contributes to upward inferences about H. An inference channel exists if information about
some set of data H may be inferred from data in another set C which is at a lower level than H, or
is incomparable relative to H in the classification lattice.

4.1.6 The Fact Model

Sowerbutts and Cordingley provide a formal definition of inference based on an abstract data
model called the Fact Model [Cordingley 89, 90; Sowerbutts 90]. They identify two aspects of the
inference problem: static inference and dynamic inference. Static inference is knowledge of the
database together with authorized facts that can be used to determine unauthorized facts. Dynamic
inference is where database operations can be used together with knowledge of the structure of the
database to determine unauthorized facts. The Fact Model addresses static inference.

The Fact Model is a model of knowledge described as a set of facts and a set of constraints. The
authors provide a language for specifying application-dependent facts and constraints. The Fact
Model provides a formal statement of the standard constraints, such as functional and multivalued
dependencies, and defines explicitly the inference threat associated with each [Cordingley 89]. The
relationship between facts and subfacts is defined by construction rules. An INFER function is
defined on facts, constraints, and construction rules. Given an arbitrary subset of facts and
constraints, the INFER function identifies all the facts that are implicit in the given subset, and
hence can be inferred by a user. At this point the classification attribute can be added for each fact.

This approach forms the basis for assigning classifications such that any facts inferable from a
given subset of facts whose classifications are dominated by a classification level k also have a
classification level dominated by k. The authors call this a Static Inferentially Secure Database.

The Fact Model also is extended to represent operations such that a Dynamic Inferentially Secure
Database is defined. The Fact Model was developed as an abstraction of the relational model but
can be used to describe any conventional database structure such as network or hierarchical.

4.1.7 Inference Secure

Lin introduces the concept of navigational inference as the process of accessing HIGH data via
navigating through legally accessible LOW data [Lin 92]. Lin states that inference problems exist
if the security classifications of data are inconsistent with some database structures. He identifies
three such inference problems:

· Logical Inference Problems The security classification of a theorem (a derivable formula)
in a formal theory should be consistent with its proof. If not, then logical inference
problems arise.

· Algebraic Inference Problems The security classification of "algebraically derivable data"
should be consistent with its relational algebraic structure. If not, algebraic inference
problems arise.

· Navigational Inference Problems The security classification of data should be consistent
with its navigational operators (which generate navigational paths). If not, navigational
inference problems arise.

Lin defines a multilevel data model as inference secure if:

(1) it is a Bell-LaPadula Model (secure under MAC), and

(2) it is navigational inference free.

Lin provides further descriptions and theorems to support the definitions [Lin 92].

4.1.8 An Inference Paradigm

Marks defines database inference as [Marks 94b]:

Inference in a database is said to occur if, by retrieving a set of tuples {T) having attributes
{A} from the database, it is possible to specify a set of tuples {{T'), having attributes {A'),
where {T') is not a subset of {T) or {A') is not equal to {A).

The definition may be stated as an inference rule: IF ({T), {A}) THEN ({T'}), {A'}), which may
be denoted as ({T}, {A}) > ({T'}, {A'}). However, it may be possible to retrieve a set of tuples
{T} and the associated attributes {A} from a database and to reason, using data and attributes
outside the database, to arrive at a set of tuples {T'} and attributes {A'} that are again within the
database. That is, it is possible to form a chain of reasoning, ({T}, {A}) > ({T1}, {A1}) > ({T2},
{A2}) >... > ({T'}, {A'}) where some of the tuples and/or attributes are outside the database.
When some of the data or attributes are outside of the database, the chain of reasoning cannot be
followed by the database system. Fortunately it is not necessary to actually follow such a chain of
reasoning in order to control the inference threat. If the database contains the endpoints of the chain
(({T}, {A}) and ({T'}, {A'})) there will exist what is referred to in logic systems as a material
implication relating the two sets.

If a material implication exists such that ({T}, {A}) > ({T'}, {A'}) where ({T}, {A}) is classified
LOW and ({T'}, {A'}) is classified HIGH, then the database can offer no assurance that there does
not exist some chain of inference, using outside knowledge, that can connect the two, enabling a
LOW user to infer HIGH information. If, however, ({T}, {A}) not > ({T'}, {A'}) within the
database, then it can be guaranteed that no chain of inference, using outside knowledge or not,
exists which connects the two sets. That is, the absence of a material implication between two sets
of data is sufficient to guarantee the absence of any chain of reasoning between the sets of data. It
is not necessary for inference control, however, since material implications may be coincidental,
and not related to any reasoning process. These arguments may be reduced to:

Limitations on Database Inference: ({T}, {A}) > ({T'}, {A'}) is an inference rule capable
of being controlled by the database if and only if all the tuples {T} and {T'} are in the database
and all the properties {A} and {A'} are attributes in the database.

4.1.9 Imprecise Inference Control

Hale, Threet, and Shenoi introduce a powerful, yet practical, formalism for modeling and
controlling imprecise functional dependency (FD) based inference in relational database systems
[Hale 94]. The existence of an imprecise FD implies that if some tuple components satisfy certain
equivalencies, then other tuple components must exist and their values must be equivalent.
Imprecise FDs can specify constraints on precise and imprecise data. Examples of imprecise FDs
are: "engineers have starting salaries about 40K" and "approximately equal qualifications and
more or less equal experience demand similar salary."

An imprecise FD is formally defined from which the authors formally define an imprecise
inference channel. The non-formal definition of an imprecise inference channel is a chain of
imprecise FDs. To control and ultimately eliminate compromising imprecise inferences, it is
necessary to specify a set of imprecise inference channels considered to be secure. This
compromise specification set is defined by the database administrator. Suspect channels in a
database are compared with these secure channels. A potential security compromise exists when a
suspect channel allows the inference of information which is not coarser than information inferred
by any secure channel. Potentially compromising imprecise inference channels are eliminated by
hiding relations attributes or by "clouding" data manifested by imprecise FDs in the compromising
channels.

4.2 DATABASE DESIGN TECHNIQUES

The problem of deciding how to label multilevel database objects - data, schemata, and constraints
- should not be a problem for cleared database designers and users [Millen 91]. They ought to know
the classification of whatever they are entering into the database. Unfortunately, many inferences
are not simple ones, and the number and complexity of potential inferences can be quite large.
Thus, raising the question of how well a person can anticipate such inferences in order to classify
the data [Morgenstern 88]. The following subsections examine techniques for avoiding inference
problems in the first place, and how to detect inference channels during database design.

4.2.1 Inference Prevention Methods

Ideally, if all unauthorized disclosures are to be prevented, the Basic Security Principle [Meadows
88a, 88b] should be followed:

Basic Security Principle for Multilevel Databases - The security class of a data item should
dominate the security classes of all data affecting it.

The reason for the Basic Security Principle is clear: if the value of a data item can be affected by
data at levels not dominated by its own level, information can flow into the data item from data at
other levels.

The task of predicting or detecting all inference problems appears to be very difficult. However,
many of these problems can easily be prevented by careful consideration of the data items on which
a data item is predicated. In practice, unfortunately, two problems with the Basic Security Principle
have been identified:

1. The number of potential inferences can be quite large.

2. All possible inferences cannot be anticipated.

The combination of these problems makes the task of appropriately classifying data complex.
However, several techniques that have been developed for dealing with inferences are discussed
below. If information x permits disclosure of information y, one way to prevent this disclosure is
to reclassify all or part of information x such that it is no longer possible to derive y from the
disclosed subset of x. There are two approaches to dealing with violations of this type: reclassify
either the data or the constraints appropriately. These approaches are discussed next.

4.2.1.1 Appropriate Labeling of Data

One of the approaches suggested to handle the inference problem is to design the multilevel
database in such a way that the Basic Security Principle is maintained [Binns 92a; Burns 92; Hinke
92; Garvey 93; Smith 90b; Thuraisingham 90c]. Security constraints are processed during
multilevel database design and subsequently the schemas are assigned appropriate security levels.
Several researchers have proposed that semantic database models be used to detect (and then
prevent) some inference problems [Buczkowski 90; Hinke 88; Smith 90b, 91]. Conventional data
models (such as hierarchical, network, and relational data models) use overly simple data
structures (such as trees, graphs, or tables) to model an application environment. Semantic
database models, on the other hand, attempt to capture more of the meaning of the data by
providing a richer set of modeling constructs. Since integrity and secrecy constraints can be
expressed naturally in semantic database models [Smith 90b], they can be used to detect inference
problems during the database design phase.

Classification (or secrecy) constraints can also be used to eliminate inference problems [Akl 87].
Classification constraints are rules that are used to assign security levels to data as they are entered
into the database. Classification constraints are required to be consistent (meaning that each value
is assigned a unique security level) and complete (meaning that rules assign each value a security
level). Inconsistent classification constraints indicate potential inference problems, and incomplete
classification constraints point to incomplete labeling rules. Two different methods have been
proposed for determining the consistency and completeness of the classification constraints. One
is based on logic programming [Denning 86b], and the other is based on computational geometry
[Akl 87].

Morgenstern views the overall process of classifying a database as a constraint satisfaction
problem of the following form [Morgenstern 88]. Each potential inference which involves one or
more relations or objects (data) from the database serves as a constraint. It limits the classifications
which can be assigned to an object given the classification(s) of the other object(s). In some cases,
the constraint may uniquely determine the classification of the remaining data object. Morgenstern
defines safety for inference: a data object O, and its assigned classification label, are said to be safe
for inference if there is no upward inference possible from O. That is, object O cannot be used to
infer information about other data objects at higher or incomparable levels of the classification
lattice. A classification level is safe for inference if all the data objects having that classification
are safe.

The techniques described in this section rely on increasing the classification of one or more
elements in an offending path following its detection. Note that in practice, however, it is not
always possible to solve inference problems by raising the classification level of data that cause
inference violations. Some data may be pub