NCSC TECHNICAL REPORT - 005
Volume 1/5
Library No. S-243,039
FOREWARD
This report is the first of five companion documents to the Trusted Database
Management System
Interpretation of the Trusted Computer System Evaluation Criteria. The
companion documents
address topics that are important to the design and development of secure
database management
systems, and are written for database vendors, system designers, evaluators,
and researchers. This
report addresses inference and aggregation issues in secure database management
systems.
ACKNOWLEDGMENTS
The National Computer Security Center extends special recognition to
the authors of this
document. The initial version was written by Victoria Ashby and Sushil
Jajodia of the MITRE
Corporation. The final version was written by Gary Smith, Stan Wisseman,
and David Wichers of
Arca Systems, Inc.
The documents in this series were produced under the guidance of Shawn
P. O'Brien of the
National Security Agency, LouAnna Notargiacomo and Barbara Blaustein of
the MITRE
Corporation, and David Wichers of Arca Systems, Inc.
We wish to thank the members of the information security community who
enthusiastically gave
of their time and technical expertise in reviewing these documents and
in providing valuable
comments and suggestions.
TABLE OF CONTENTS
SECTION PAGE
1.0 INTRODUCTION l
1.1 BACKGROUND AND PURPOSE 1
1.2 SCOPE 1
1.3 INTRODUCTION TO INFERENCE AND AGGREGATION 2
1.4 AUDIENCES OF THIS DOCUMENT 3
1.5 ORGANIZATION OF THIS DOCUMENT 4
2.0 BACKGROUND 5
2.1 INFERENCE AND AGGREGATION - DEFINED AND EXPLAINED 5
2.1.1 Inference 5
2.1.1.1 Inference Methods 6
2.1.1.2 Inference Information 7
2.1.1.3 Related Disciplines 7
2.1.2 Aggregation 8
2.1.2.1 Data Association Examples 9
2.1.2.2 Cardinal Aggregation Examples 9
2.2 SECURITY TERMINOLOGY 10
2.3 THE TRADITIONAL SECURITY ANALYSIS PARADIGM 12
2.4 THE DATABASE SECURITY ENGINEERING PROCESS 14
2.5 THE REQUIREMENTS FOR PROTECTING AGAINST INFERENCE
AND AGGREGATION 15
2.5.1 The Operational Requirement 16
2.5.2 The TCSEC Requirement 16
3.0 AN INFERENCE AND AGGREGATION FRAMEWORK 18
3.1 DETAILED EXAMPLE 18
3.2 THE FRAMEWORK 20
3.2.1 Information Protection Requirements and Resulting Vulnerabilities
20
3.2.2 Classification Rules 20
3.2.3 Vulnerabilities from Classification Rules 25
3.2.4 Database Design 26
3.2.5 Vulnerabilities from Database Design 27
3.2.6 Database Instantiation and Vulnerabilities 28
3.3 OVERVIEW OF RESEARCH 28
4.0 APPROACHES FOR INFERENCE CONTROL 31
4.1 FORMALIZATIONS 31
4.1.1 Database Partitioning 31
4.1.2 Set Theory 32
4.1.3 Classical Information Theory 33
4.1.4 Functional and Multivalued Dependencies 33
4.1.5 Sphere of Influence 34
4.1.6 The Fact Model 34
4.1.7 Inference Secure 35
4.1.8 An Inference Paradigm 35
4.1.9 Imprecise Inference Control 36
4.2 DATABASE DESIGN TECHNIQUES 37
4.2.1 Inference Prevention Methods 37
4.2.1.1 Appropriate Labeling of Data 37
4.2.1.2 Appropriate Labeling Integrity Constraints 38
4.2.2 Inference Detection Methods 39
4.3 RUN-TIME MECHANISMS 40
4.3.1 Query Response Modification 41
4.3.1.1 Constraint Processors 41
4.3.1.2 Database Inference Controller 42
4.3.1.3 Secondary Path Monitoring 43
4.3.1.4 MLS System Restrictions 43
4.3.2 Polyinstantiation 44
4.3.3 Auditing Facility 44
4.3.4 Snapshot Facility 45
4.4 DATABASE DESIGN TOOLS 47
4.4.1 DISSECT 47
4.4.2 AERIE 48
4.4.3 Security Constraint Design Tool 49
4.4.4 Hypersemantic Data Model and Language 49
5.0 APPROACHES FOR AGGREGATION 51
5.1 DATA ASSOCIATION 51
5.1.1 Formalizations 52
5.1.1.1 Lin's Work 52
5.1.1.2 Cuppens' Method 54
5.1.2 Techniques 54
5.2 CARDINAL AGGREGATION 56
5.2.1 Techniques 56
5.2.1.1 Single Level DBMS Techniques 56
5.2.1.1.1 Database Design and Use of DAC 56
5.2.1.1.2 Use of Views 57
5.2.1.1.3 Restricting Ad Hoc Queries 58
5.2.1.1.4 Use of Audit 58
5.2.1.1.5 Store all Data System-High 59
5.2.1.2 Multilevel DBMS Techniques 59
5.2.1.2.1 Use of Labeled Views 59
5.2.1.2.2 Start with All Data System-High 60
5.2.1.2.3 Make Some Data High 60
5.2.2 Mechanisms 61
5.2.2.1 Aggregation Constraints 61
5.2.2.2 Aggregation Detection 61
6.0 SUMMARY 63
REFERENCES 65
LIST OF FIGURES
FIGURE PAGE
2.1: TRADITIONAL SECURITY ANALYSIS PARADIGM 13
2.2: DATABASE SECURITY ENGINEERING PROCESS 14
3.1: ENTITIES AND ASSOCIATIONS 19
3.2: ATC ATTRIBUTES 19
3.3: INFERENCE AND AGGREGATION FRAMEWORK 21
3.4: ATC EXAMPLE DATABASE DESIGN 27
4.1: PARTITIONING OF THE DATABASE FOR A USER U 32
4.2: DATABASE WITHOUT AN INFERENCE PROBLEM 32
4.3: DATABASE WITH AN INFERENCE PROBLEM 32
4.4: QUERY PROCESSOR 41
4.5: DBIC SYSTEM CONFIGURATION 42
4.6: GENERAL-PURPOSE DATABASE MANAGEMENT SYSTEM MODEL 46
4.7: DATABASE MANAGEMENT SYSTEM MODEL AT THE
U. S. BUREAU OF THE CENSUS 46
5.1: ATC COMPANY DATA ASSOCIATION EXAMPLES 55
LIST OF TABLES
TABLE PAGE
3.1: TAXONOMY OF PROTECTION REQUIREMENTS AND
CLASSIFICATION RULES 22
3.2: FORMAL DEFINITIONS 29
3.3: RESEARCH SUMMARY (PART 1) 29
3.3: RESEARCH SUMMARY (PART 2) 30
SECTION 1
INTRODUCTION
This document is the first volume in the series of companion documents
to the Trusted Database
Management System Interpretation of the Trusted Computer System Evaluation
Criteria [TDI 91;
DoD 85]. This document examines inference and aggregation issues in secure
database
management systems and summarizes the research to date in these areas.
1.1 BACKGROUND AND PURPOSE
In 1991 the National Computer Security Center published the Trusted Database
Management
System Interpretation (TDI) of the Trusted Computer System Evaluation
Criteria (TCSEC). The
TDI, however, does not address many topics that are important to the design
and development of
secure database management systems (DBMSs). These topics (such as inference,
aggregation, and
database integrity) are being addressed by ongoing research and development.
Since specific
techniques in these topic areas had not yet gained broad acceptance, the
topics were considered
inappropriate for inclusion in the TDI.
The TDI is being supplemented by a series of companion documents to address
these issues
specific to secure DBMSs. Each companion document focuses on one topic
by describing the
problem, discussing the issues, and summarizing the research that has
been done to date. The intent
of the series is to make it clear to DBMS vendors, system designers, evaluators,
and researchers
what the issues are, the current approaches, their pros and cons, how
they relate to a TCSEC/TDI
evaluation, and what specific areas require additional research. Although
some guidance may be
presented, nothing contained within these documents should be interpreted
as criteria.
These documents assume the reader understands basic DBMS concepts and
relational database
terminology. A security background sufficient to use the TDI and TCSEC
is also assumed;
however, fundamentals are discussed whenever a common understanding is
important to the
discussion.
1.2 SCOPE
This document addresses inference and aggregation issues in secure DBMSs.
It is the first of five
volumes in the series of TDI companion documents, which includes the following
documents:
· Inference and Aggregation Issues in Secure Database Management
Systems
· Entity and Referential Integrity Issues in Multilevel Secure
Database Management
Systems [Entity 96]
· Polyinstantiation Issues in Multilevel Secure Database Management
Systems [Poly 96]
· Auditing Issues in Secure Database Management Systems [Audit
96]
· Discretionary Access Control Issues in High Assurance Secure
Database Management
Systems [DAC 96]
This series of documents uses terminology from the relational model to
provide a common basis
for understanding the concepts presented. This does not mean that these
concepts do not apply to
other database models and modeling paradigms.
1.3 INTRODUCTION TO INFERENCE AND AGGREGATION
Many definitions of inference and aggregation have been suggested in
the literature. In fact, one of
the challenges to understanding inference and aggregation is that there
are different (but sometimes
similar) notions of what they are. We will discuss the differences and
similarities at length in later
sections. At this point, the following general definitions are sufficient:
Inference or the inference problem is that of users deducing (or inferring)
higher level
information based upon lower, visible data [Morgenstern 88].
Aggregation or the aggregation problem occurs when classifying and protecting
collections
of data that have a higher security level than any of the elements that
comprise the aggregate
[Denning 87].
Inference and aggregation problems are not new, nor are they only applicable
to DBMSs.
However, the problems are exacerbated in multilevel secure (MLS) DBMSs
that label and enforce
a mandatory access control (MAC) policy on DBMS objects [Denning 86a].
Inference and aggregation differ from other security problems since the
leakage of information is
not a result of the penetration of a security mechanism but rather it
is based on the very nature of
the information and the semantics of the application being automated [Morgenstern
88].
1.4 AUDIENCES OF THIS DOCUMENT
This document is targeted at four primary audiences: the security research
community, database
application developers/system integrators, trusted product vendors, and
product evaluators.
Inference and aggregation are problems based on the value and semantics
of data for the part of the
enterprise being automated. Understanding inference and aggregation is
important for all of the
audience categories, but the audience for whom this document is most relevant
includes those
involved in designing and engineering automated information systems (AIS)--in
particular, the
database application developers/system integrators. In general, this document
is intended to
present a basis for understanding the issues surrounding inference and
aggregation especially in
the context of MLS DBMSs. Implemented approaches and ongoing research
are examined.
Members of the specific audiences should expect to get the following from
this document:
Researcher
This document describes the basic issues associated with inference and
aggregation. Important
research contributions are discussed as various topics are examined. By
presenting current theory
and debate, this discussion will help the research community understand
the scope of the issue and
highlight approaches and alternative solutions to inference and aggregation
problems. For
additional relevant work, the researcher should consult two associated
TDI companion documents:
Polyinstantiation Issues in Multilevel Secure Database Management Systems
[Poly 96] and Entity
and Referential Integrity Issues in Multilevel Secure Database Management
Systems [Entity 96].
Database Application Developer/System Integrator
This document highlights the need for analysis of the application-dependent
semantics of an
application domain in order to understand the impact of inference and
aggregation on MLS
applications. It describes the basic issues and current research into
providing effective solutions.
Trusted Product Vendor
This document describes the types of mechanisms from the research literature
that can be
incorporated into products to assist application developers with enforcing
aggregation-based
security policies. Research on tools to support inference analysis provides
the basis for additional
tools that might be provided with a DBMS product.
Evaluator
This document presents an understanding of inference and aggregation
issues and how they relate
to the evaluation of a candidate MLS DBMS implementation.
1.5 ORGANIZATION OF THIS DOCUMENT
The organization of the remainder of this document is as follows:
· Section 2 discusses several first principles that form the foundation
for reasoning about
inference and aggregation. Included in Section 2 are subsections that
define and explain
many aspects of inference and aggregation, present relevant security terminology
(e.g.,
comparing inference to direct disclosure), and describe two fundamental
processes: the
traditional security analysis paradigm and the database security engineering
process.
· Section 3 presents a framework for reasoning about inference
and aggregation based on
the two processes described in Section 2. Section 3 also introduces a
detailed example that
is used throughout the document to illustrate the material. Section 3
concludes with a
matrix that shows how each research effort discussed in Sections 4 and
5 relate to the
framework.
· Section 4 summarizes research efforts on inference.
· Section 5 summarizes research on aggregation.
· Section 6 summarizes the contents of this document.
SECTION 2
BACKGROUND
Inference and aggregation are different but related problems. In order
to understand what they are,
what problems they embody, and their relationship, it is essential to
revisit several first principles.
In Section 2.1, we give further descriptions and discuss inference and
aggregation in some detail,
including examples. Some important terminology and security concepts are
covered in Section 2.2.
The next two sections describe fundamental processes that form the basis
for reasoning about
inference and aggregation: the traditional security analysis paradigm
(Section 2.3) and the database
security engineering process (Section 2.4). The final section summarizes
the operational and
TCSEC requirements for inference and aggregation.
Although much of what will be discussed in this section will be considered
by some as intuitively
obvious, it is these principles that are often forgotten and lead to misunderstandings.
Even
experienced researchers and practitioners should review this section.
2.1 INFERENCE AND AGGREGATION - DEFINED AND EXPLAINED
This section provides further definitions and explanations of inference
and aggregation problems.
Many examples are given to illustrate the relationship and complexity
of these problems.
2.1.1 Inference
Inference, the inferring of new information from known information, is
an everyday activity of
most humans. In fact, inference is a topic of interest for several disciplines,
each from its own
perspective. In the context of this document:
Inference addresses the problem of preventing authorized users of an
AIS from inferring
unauthorized information.
Preventing unauthorized inferences is complicated both by the variety
of ways in which humans
infer new information and by the wide variety of information they use
to make inferences.
Some researchers characterize the inference problems in terms of inference
channels [Morgenstern
88; Meadows 88a]. SRI's research identifies three types of inference channels
based on the degree
to which HIGH data may be inferred from LOW data [Garvey 94]:
· Deductive Inference Channel. The most restrictive channel, requiring
a formal deductive
proof (described by propositional or first-order logic) showing that HIGH
data can be
derived from LOW data.
· Abductive Inference Channel. A less restrictive channel where
a deductive proof could be
completed assuming certain LOW-level axioms.
· Probabilistic Inference Channel. This channel occurs when it
is possible to determine that
the likelihood of an unauthorized inference is greater than an acceptable
limit.
These descriptions of inference channels combine the method by which
the inference is made and
the information used to make the inference. Both of these areas of complexity
are discussed further
in the following paragraphs. Section 2.1.1.3 identifies related disciplines
and how they address
inference.
2.1.1.1 Inference Methods
One of the factors that makes the inference problem so difficult is that
there are many methods by
which a human can infer information. The following list [Ford 90] illustrates
the variety of
inference strategies that can be used:
· Inference by Deductive Reasoning New information is inferred
using well-formed rules,
either through classical or non-classical logic deduction. Classical deductions
are based on
basic rules of logic (e.g., assertion that "A is true" and if
"A then B" deduces "B is true").
Non-classical logic-based deduction includes probabilistic reasoning,
fuzzy reasoning,
non-monotonic reasoning, default reasoning, temporal logic, dynamic logic,
and modal
logic-based reasoning.
· Inference by Inductive Reasoning Well formed rules are utilized
to infer hypotheses from
observed examples.
· Inference by Analogical Reasoning Statements such as "X
is like Y" are used to infer
properties of X given the properties of Y.
· Inference by Heuristic Reasoning New information is deduced
using various heuristics.
Heuristics are not well defined and may be a "rule of thumb."
· Inference by Semantic Association The association between entities
is inferred from the
knowledge of the entities themselves.
· Inferred Existence One can infer the existence of entity Y from
certain information (e.g.,
from the information, "John lives in Boston," one can infer
that "there is some entity called
John").
· Statistical Inference Information about an individual entity
is inferred from various
statistics compared on a set of entities.
2.1.1.2 Inference Information
The other factor that makes preventing unauthorized inferences within
a DBMS a difficult problem
is the wide variety of information that can be used to make an inference.
The following are some
types of information that can be used to make inferences:
· Schema metadata including relation and attribute names
· Other metadata such as constraints (e.g., value constraints
or other constraints that may be
implemented through triggers or procedures)
· Data stored in the relations (i.e., the value of the data)
· Statistical data derived from the database
· Other derived data that can be determined from data in the database
· The existence or absence of data (as opposed to the value of
the data)
· Data disappearing (e.g., upgraded) or changing value over time
· Semantics of the data that are not stored in the database but
are known within the
application domain
· Specialized information about the application domain (both process
and data) not stored
in the database
· Common knowledge, and
· Common sense
Denning [Denning 86a] and many other researchers adopt a so-called "closed-world
assumption"
that limits inference to the database. With this limitation, both the
information used to make the
inference and the inferred information must be represented in the database.
The first seven types
of information described above encompass information that would be included
using this closed-
world assumption. Unfortunately, in the real world in which a DBMS must
operate, significant
amounts of the remaining types of information will be used by an adversary
to attempt to infer
unauthorized information.
2.1.1.3 Related Disciplines
Inference in the context of this document focuses on the information
contained in a DBMS and
preventing inference of unauthorized information by authorized users of
the DBMS. Inference is
also a key issue in several other disciplines including:
· Operations Security (OPSEC) OPSEC is concerned with denying
external adversaries the
ability to infer classified information by observing unclassified indicators.
The key
difference is that the adversary is external to the organization, while
this document
addresses individuals that are internal to the organization (i.e., DBMS
users).
· Artificial Intelligence (AI) The ability to infer new facts
from existing knowledge is a
positive goal of several of the sub-disciplines of AI. The AI research
community is looking
for better ways to represent knowledge and approaches for inferring new
information.
· Uncertainty Management The inverse problem of preventing inferences
is the need for
individuals, such as intelligence analysts, to infer information about
an adversary based
upon evidence that can be collected. There are several approaches to evaluating
evidence
in an effort to quantify one's confidence in the evidence and the resulting
inferred
information [Schum 87].
2.1.2 Aggregation
Like the word inference, the word aggregation has multiple meanings in
the information
technology community. An aggregation, or aggregate, is a collection of
separate data items. In the
database context, data aggregation normally has a positive connotation.
Aggregation is an
important and desirable side effect of organizing information into a database.
Information has a
greater value when it is gathered together, made easily accessible, and
structured in a way that
shows relationships between pieces of data [National Academy Press 83].
In some sense, every
database is an aggregation of data.
The term aggregation has a different, more common use in the commercial
database management
community, where it usually refers to the use of aggregate operators.
The five standard operations
are sum, average, count, maximum, and minimum. Others such as standard
deviation or variance
may also be found in query languages [Ullman 88]. These operators are
applied to columns of a
relation to obtain a single quantity. They allow conclusions to be drawn
from the mass of data
present in a database.
The aggregate operators in commercial DBMSs relate closely to the operations
performed on
statistical databases. Statistical databases are used to answer queries
dealing with sums, averages,
and medians, and use aggregation to hide the separate data items found
in individual records. The
concern in a statistical database is for security in the sense of privacy,
not classification. Security
in statistical databases has a large body of literature and will only
be discussed briefly in this
document.
In the context of this document:
Aggregation involves the situation where a collection of data items is
explicitly considered to
be classified at a higher level than the classification assigned to the
individual data elements
[Lunt 89].
Denning has identified two classes of the aggregation problem [Denning
87]. We describe them
here in terms of entities and their attributes.
· Attribute Associations Associations between instances of different
data items of which
there are two sub-classes.
Associations between Entities Associations between instances of attributes
of different
entities.
Associations between Attributes of an Entity Associations between instances
of different
attributes of an entity.
· Size-based Record Associations Associations among multiple instances
of the same entity.
We will use the current terminology to represent these classes of aggregation:
Data Association
[Jajodia 92] for Attribute Associations (both Inter- and Intra-Entity)
and Cardinal Aggregation
[Hinke 88] for Size-based Record Associations.
The difference between data association and cardinal aggregation is important.
Data association
refers to aggregates of instances of different data objects; cardinal
aggregation refers to aggregates
of instances of the same type of entity. Distinguishing between data association
and cardinal
aggregation as two types of aggregation is also important since data association
can be effectively
controlled using database design techniques with labeled data [Denning
87]. On the other hand,
effective mechanisms to enforce cardinal aggregation policies are still
a research issue.
2.1.2.1 Data Association Examples
A traditional example of intra-entity data association involves salary:
the identity of an employee
and the actual number that represents the employee's salary may both be
unclassified, but the
association between a particular employee and the employee's salary is
considered sensitive. This
is an instance of intra-entity data association as defined above: employee
name (or other
identifying attributes such as employee number) and salary are both attributes
of the employee
entity.
An example of inter-entity data association is the following: the identity
of commercial companies
and government agencies are unclassified, but the association of a particular
company as the
contractor for a particular government agency may be considered classified.
In this case, the data
association is between two entities, commercial company and government
agency.
2.1.2.2 Cardinal Aggregation Examples
The traditional example of cardinal aggregation is the intelligence agency
telephone book:
individual entries are unclassified but the total book is classified [Lunt
89]. The challenge with this
example is understanding what information the agency is trying to protect
with the policy. Is it the
name and number of staff personnel (and therefore capability) of each
(or some) organizational
element? The total staffing size of the agency? Identification of special
employees? Special skills?
A clear understanding of what information needs to be protected must precede
a cardinal
aggregation policy.
Another example of a cardinal aggregation policy relates to an Army troop
list, that is, a list of
Army organizations [Smith 91]. Information about each Army organization
is unclassified (even
if the organization's mission is classified), but the list as a whole
is classified. Getting someone to
determine how many organizations can be aggregated and still remain unclassified
information is
difficult. If the intent is to protect the Army's war fighting capability,
then aggregating the combat
arms units (as opposed to the support and civilian-based organizations)
will directly disclose that
information. Of course, one has to know about the war fighting capabilities
of each type of combat
arms unit, which is unclassified information and can be obtained externally
to the troop list.
Unfortunately, there may be other ways to infer the unauthorized information.
For example, if one
can determine the tactical communications capability (i.e., aggregation
of communications units),
can one then infer the war fighting capability? And what about any number
of other aggregates of
units where there is a reasonable understanding of their relationship
to war fighting capability? It
quickly becomes a difficult problem to describe, let alone to provide
mechanisms to solve this type
of aggregation problem. Yet keeping information about individual units
unclassified is essential
for operational efficiency.
A final example is network management data: information about outages
of individual circuits is
not sensitive, but the aggregation of information about all circuit outages
for a network is sensitive.
In this case, one can postulate that this cardinal aggregation policy
is attempting to protect the
identification of vulnerabilities of the network from an adversary that
might want to disrupt or
render the network inoperative. With information about only one or two
circuit outages, an
adversary cannot mount an effective attack to bring the network down.
On the other hand, if the
adversary knows all the circuit outages, the network may have a temporary
single point of failure,
a vulnerability that the adversary can exploit. It is operationally impractical
to treat information
about each circuit as sensitive, yet in the face of a credible threat,
it is important to protect the
aggregate information.
All of these example cardinal aggregation policies were established as
a trade-off between the need
to protect sensitive information versus the need for operational efficiency.
To some researchers,
cardinal aggregation policies are considered fundamentally unsound [Meadows
91). Thus, it
should not be surprising that effective mechanisms to implement what some
consider an unsound
policy elude the community.
2.2 SECURITY TERMINOLOGY
This section discusses several security-relevant terms and concepts.
We distinguish between two
ways in which information can be revealed:
· Direct Disclosure. Classified information is directly revealed
to unauthorized individuals.
· Inference. Classified information may be deduced by unauthorized
individuals with some
level of confidence often less than 100%.
The primary distinction, then, between direct disclosure and inference
is whether. the information
is revealed (direct disclosure) or must be deduced (inference). For example,
if a user at LOW can
execute a query to a DBMS and receive a response where the data displayed
to the LOW user is
actually classified HIGH (i.e., based on an explicit classification rule),
then this is a direct
disclosure problem not an inference problem. In most instances of inference,
the adversary has
some uncertainty with respect to the accuracy of the inference of the
unauthorized information.
For the purposes of this document, we do not distinguish between where
the HIGH data that is
directly disclosed or inferred is located (i.e., in the database or not).
The important consideration
is that HIGH data was directly disclosure or inferred. This consideration
is independent of whether
the HIGH data is actually stored in the database or it exists somewhere
else in the application
domain of discourse.
Data association policies are stated to prevent direct disclosure of
classified information related to
an association between data elements. In most cases, cardinal aggregation
policies are stated to
prevent inferences that could be made from the aggregate data.
The discussion above refers generically to "unauthorized" individuals.
In fact, we must distinguish
between two types of unauthorized access to information:
· based on authorized access to classified information (i.e.,
a label-based policy); and
· based on informal "need-to-know" policies.
"Need-to-know" is a well-known concept and is defined in the
DoD Information Security
Regulation [5100.1-R 86] as:
· "A determination made by a possessor of classified information
that a prospective
recipient, in the interest of national security, has a requirement for
access to, or knowledge,
or possession of the classified information in order to accomplish lawful
and authorized
government purposes."
There are two distinct types of need-to-know: "informal" (sometimes
called discretionary), and
"formal" need-to-know. With informal need-to-know, individuals
may have the clearance (i.e., are
considered trustworthy to properly protect the information) but not the
job-related need to actually
access the information. Informal need-to-know is enforced in the people
and paper world by the
individual who physically controls access to a classified document. In
an AIS, informal need-to-
know is normally enforced by discretionary access control (DAC) mechanisms
as described in the
TCSEC. Formal need-to-know is enforced through compartments (or other
release markings) to
which individuals are formally granted access. Mandatory access control
(MAC) categories are
used in an AIS to enforce access based on formal need-to-know.
Direct disclosure is normally only thought of as being a problem when
classified information is
revealed to an individual without the appropriate clearance, including
formal access to
compartments (i.e., formal need-to-know). Given that U.S. Government information
is classified
based on laws or executive orders, direct disclosure of classified information
is punishable by
imprisonment. On the other hand, disclosing information in violation of
an informal need-to-know
policy is normally not considered a compromise of classified information,
since the individual has
the appropriate clearance to see the information and is trusted to properly
protect it. Thus, a
violation of an informal need-to-know policy may result in an employee
being fired or sued, but
not imprisoned.
Although an authorized user can attempt to infer information for which
the user does not have an
informal need-to-know, research efforts to date in inference and aggregation
have dealt exclusively
with direct disclosure or inference based on compromise of classified
information, not on
violations of informal need-to-know policies. As such, this document focuses
almost exclusively
on protecting classified information against inference and aggregation
threats from individuals
who are not cleared to receive this information.
2.3 THE TRADITIONAL SECURITY ANALYSIS PARADIGM
To better understand inference and aggregation and how they relate to
each other, they need to be
discussed in the context of the traditional security analysis paradigm
shown in Figure 2.1. A brief
description of the activities in the paradigm follows:
· Identify information protection requirements The protection
requirements traditionally
addressed by the security community relate to confidentiality. However,
information
protection requirements for integrity and availability should also be
specified. Although,
the inference and aggregation problems strictly address confidentiality
protection
requirements, solutions to these problems typically adversely impact the
ability to meet the
integrity and availability requirements.
· Identify vulnerabilities Given a set of protection requirements,
one must identify the
vulnerabilities that exist which could be exploited to prevent those requirements
from
being satisfied by the system. As the next section shows, vulnerabilities
need to be
assessed continuously throughout the design and engineering process.
· Identify threats Given identified vulnerabilities, one must
identify credible threat agents
that can exploit those vulnerabilities. For inference and aggregation,
the primary threat is
the insider; an authorized user of the system who is authorized access
to some, but not all,
information in the system. Malicious code is a secondary threat.
· Conduct a risk analysis The risk analysis must take into account
the likelihood of a threat
agent actually exploiting a vulnerability. The risk analysis considers
tradeoffs between
effectiveness, cost, and the ability to meet other requirements (e.g.,
integrity and
availability) to determine appropriate controls.
· Select controls Based on the risk analysis, controls are selected
to eliminate or reduce a
threat agent's ability to exploit the vulnerability. Note that some controls
are selected to
prevent while others are selected to reduce or detect a threat agent's
ability to exploit the
vulnerability. This distinction is important in that most cases dealing
with inference
involve determining during the risk analysis the likelihood of an individual
inferring
unauthorized information given a particular set of controls. The goal
is to establish
controls that provide an acceptable level of risk at an affordable cost,
where cost is cast in
terms of both money and adverse impact on functionality, ease of use,
and performance.
Figure 2.1: Traditional Security Analysis Paradigm
Details of how this process is related to inference and aggregation are
discussed in the next section.
2.4 THE DATABASE SECURITY ENGINEERING PROCESS
The process of implementing an AIS with a database that meets the specified
security requirements
is shown in Figure 2.2. The process should begin with a clear understanding
of what information
needs to be protected, in this case from unauthorized disclosure. If one
does not have a clear
understanding of what has to be protected, then it makes secure database
design difficult.
Unfortunately, many inference and aggregation problems are stated at the
database design level
(e.g., with sensitivity labels on tables) without a description of what
information is to be protected
from disclosure.
Given a precise statement of the information to protect, there are two
distinct aspects of
establishing rules for implementing those confidentiality protection requirements:
· classification rules must be established that specify the confidentiality
properties of the
information, and
· access control rules must be established that specify which
authorized users can access
information based on the classification rules.
Figure 2.2: Database Security Engineering Process
Classification rules should address all information that needs to be
protected, not just information
that must be protected for national security reasons (i.e., Confidential,
Secret, and Top Secret)
based on Executive Order 12356. For example, the Computer Security Act
of 1987 requires
unclassified but sensitive information have confidentiality protection.
Categories of unclassified
but sensitive information include, but are not limited to, personnel (i.e.,
subject to the Privacy Act
of 1984), financial, acquisition contracting, and investigations. Certain
other information is
classified under the Atomic Energy Act of 1954 as Restricted Data or Formerly
Restricted Data.
Confidentiality protection requirements are also applicable to non-government
organizations. For
example, proprietary information of many types may be considered "bet
your company" data and
needs to be protected to ensure the viability of the company.
Given a set of classification rules, the database designer must provide
a database design which
labels data objects and correctly implements the classification rules.
The database instantiation
represents the database as it is populated with data and used.
Access control rules are generically described as MAC for access based
on the military
classification system (including formal need-to-know) and DAC for access
based on informal
need-to-know. However, to effectively implement the protection requirements,
one must explicitly
identify the application-dependent information which identifies which
authorized users, or classes
of users (e.g., roles), are authorized access to each type of information
identified in the
classification rules.
The access control rules must be incorporated into the application system
design. Although the
access control rules can be implemented through mechanisms in the operating
system or
application software, inference and aggregation concerns are primarily
addressed by DBMS access
controls.
Classification rules are normally found in a Classification Guideline
as required by the DoD
Information Security Program Regulation [DoD 5100.1-R]. Generic access
control rules are
documented in the system security policy as required by the TCSEC. The
instantiation of MAC
and DAC with application-dependent access control rules that identify
specific users and data is
usually documented in a Security Concept of Operations.
The classification rules for an application domain need to be complete.
That is, all information to
be contained in the support information system should be explicitly addressed.
2.5 THE REQUIREMENTS FOR PROTECTING AGAINST INFERENCE AND
AGGREGATION
In this section, we identify both operational and TCSEC requirements
related to protecting against
inference and aggregation. Operational requirements, identified in Section
2.5.1, include more than
just the need to protect against inference and aggregation but also operational
considerations, or
tradeoffs, that impact decisions on how, or to what extent, inference
and aggregation protections
are implemented. Section 2.5.2 discusses the lack of direct TCSEC requirements
for protecting
against inference and aggregation.
2.5.1 The Operational Requirement
There are at least four operational requirements related to inference
and aggregation protection.
1. The enterprise's information must be properly protected from disclosure
either directly or
through inference.
2. Employees (i.e., users) must have access to the information they need
to do their jobs.
3. The enterprise must conduct its operations in an efficient and cost-effective
manner.
4. The enterprise must interact efficiently with external organizations,
which may involve
mostly unclassified information.
Unfortunately, these operational requirements often are in conflict.
Over-classifying data to meet
the first requirement may reduce availability of information to certain
employees (conflicts with
second requirement). Cost-effective operations (the third requirement)
may demand granting the
lowest clearances possible to employees, which may conflict with the established
classification
rules to protect information (first requirement). Finally, the need to
protect information (first
requirement) may conflict with the efficient interface with other organizations
(fourth
requirement). This last conflict is the genesis of most cardinal aggregation
policies.
2.5.2 The TCSEC Requirement
There are no requirements in the TCSEC, TDI, or Trusted Network Interpretation
(TNI) [TNI 87]
that directly address inference and aggregation issues. Inference and
aggregation are properties
intrinsic to data values and to the structure of a database (schema, metadata).
Inference and
aggregation are based on an interpretation of the enterprise's security
policy and on specific
requirements of the application domain being automated. Thus, inference
and aggregation should
best be addressed by the system developer, acquisition agency, and certification
and accreditation
authorities for each system and should not be looked at during a TCSEC
/TDI evaluation.
Since inference is often characterized as an inference channel, there
is often some confusion as to
the relationship between inference and covert channels. Inference and
aggregation are possible
because of the existence of a set of specific known data values and a
set of logical implication rules.
Knowledgeable database administrators or information security officers
may be able to apply
inferential engines to the analysis of a populated database to uncover
a set of potential unauthorized
inferences which they can then protect.
However, new inferences may be introduced through error on the part of
users authorized to update
the database. It is important to note that the introduction of inference
or aggregation is a data value-
based problem (including metadata that represents the database design),
and not one of DBMS
(i.e., the product) design or implementation. Unlike Trojan horses or
covert channels, values
needed to support an inferential attack are introduced consciously by
benign actions of users, and
not clandestinely by malicious implanted code or by flaws in a DBMS trusted
computing base
(TCB) design or implementation. In this sense, inference problems are
different from covert
channel issues since all the data used to make the inference are at LOW,
and there is no need for
an active agent at HIGH to signal information from HIGH to LOW [Jajodia
92]. Note also that
inference will still produce the same result, independent of whether the
compromised (inferred)
data remains in (or ever existed in) the database at HIGH, as opposed
to the need for HIGH data
to be present in order for it to be signaled to LOW by a covert channel.
SECTION 3
AN INFERENCE AND AGGREGATION FRAMEWORK
The purpose of this section is to present a framework to reason about
inference and aggregation.
Section 3.1 begins with a description of a detailed example application
domain that will be used to
illustrate parts of the framework. Section 3.2 describes the framework
using instances from the
detailed example for illustration. Section 3.3 presents a matrix which
summarizes how the research
efforts described in Sections 4 and 5 relate to the framework.
3.1 DETAILED EXAMPLE
This section introduces a detailed example used to illustrate the concepts
presented in this
document. We use a mythical company, the Acme Technology Company or ATC.
ATC contracts
with many government agencies and commercial companies to provide technology-oriented
products and services. ATC also conducts extensive applied research for
its clients in several areas
of technology. ATC needs extensive support for its business operations
by AISs that must represent
information about:
· entities of interest in the application domain, and
· functional or business processes used in the application domain.
Although there are other security considerations that should be addressed
in designing functional
processes, for the purpose of this document, we will only address the
process view as it relates to
access control.
We are most interested in understanding and modeling the information
about entities. Tn
particular, we are concerned with identifying the entities, associations,
and attributes of interest to
ATC.
Entities (real objects or concepts of interest):
· Employees ATC employs a wide variety of personnel from clerks
to very specialized
engineers.
· Education Data about the education of each ATC employee is retained.
· Research Projects Although ATC has thousands of projects that
cover a wide variety of
purposes, research projects represent a special category of projects,
especially from a
security standpoint.
· Critical Technologies Certain technologies are particularly
sensitive.
· Clients Information about the organization that sponsors a project.
Associations (relationships between instances of the entities). Figure
3.1 shows the following
associations (in italics) between entities:
· Employees work-on Projects.
· Employees have Education.
· A Client sponsors Projects.
· Some Projects use Critical Technologies.
Figure 3.1: Entities and Associations
Figure 3.2.: ATC Attributes
Attributes (characteristics or properties of an entity). The addition
of attributes adds more detail
and, therefore, realistic complexity to the example. Figure 3.2 shows
selected attributes for each
ATC entity.
3.2 THE FRAMEWORK
The framework, shown in Figure 3.3, is an extension to the database security
engineering process
described in Section 2.4. The framework provides the context to discuss
how inference and
aggregation are applicable at each step in the development process. Given
that inference and
aggregation are inherently application-dependent, examples, based on the
ATC example in the
previous section, are given to illustrate the concepts. This framework
with examples provides a
basis to illustrate and compare the proposed approaches described in Sections
4 and 5.
3.2.1 Information Protection Requirements and Resulting Vulnerabilities
There are at least three types of information protection requirements:
confidentiality, integrity, and
availability. Inference and aggregation relate exclusively to the confidentiality
objective of
security. Therefore, although integrity and availability protection requirements
are also important,
this document will only address confidentiality requirements.
Given a set of information confidentiality protection requirements, there
are two vulnerabilities
that must be addressed by appropriate controls: direct disclosure and
inference. Both of these
vulnerabilities were discussed in Section 2.2. Aggregation is not yet
a problem, because the
protection requirements address specific facts or information that must
be protected as opposed to
a protection requirement for an aggregate of data.
3.2.2 Classification Rules
Classification rules for information are stated such that direct disclosure
will be prevented, and the
ability to infer unauthorized information is reduced to an acceptable
level. Preventing direct
disclosure is fairly straightforward: state classification rules for entities,
attributes, and/or
associations and then restrict access based on user clearance and data
classification. Unfortunately,
in practice, protection requirements can be so complex that ensuring that
a set of classification rules
are correct (i.e., consistent and complete) can be difficult.
Each of the following types of classification rules from Figure 3.3 is
described in the following
paragraphs:
· entity;
· attribute;
· data association; and
· cardinal aggregation.
Figure 3.3: Inference and Aggregation Framework
Table 3.1 summarizes this section by showing a taxonomy of types of confidentiality
protection
requirements along with examples of protection requirements and supporting
classification rules
for the ATC example. The ATC classification rules from Table 3.1 are used
to illustrate the types
of classification rules described in this section. The types of protection
requirements are a synthesis
of those found in papers by Smith, Pernul, and Burns, [Smith 90a; Pernul
93; Burns 92] and are
meant to be illustrative of the primary types of protection requirements
but not necessarily an
exhaustive taxonomy.
Type Protection
Example Protection Requirement
Example Classification Rules
Entity Classification
1a
existence of all
instances of an
entity
· the fact that ATC is involved with
Critical Technologies must be protected
at LOW
· the existence of all instances of Critical
Technologies entity are classified LOW
1b
existence of
selected instances
of an entity
· the fact that a Research Project involves
Critical Technologies must be protected
at HIGH
· the existence of Research Projects that
use Critical Technologies are classified
LOW
· the existence of instances of all other
entities is UNCLASSIFIED
2a
identity of all
instances of an
entity
· the identity of Critical Technologies
must be protected at HIGH
· the identity of Clients of Research
Projects must be protected at LOW
(this is a requirement of the client
organization)
· the value of all instances (i.e., all
attributes) of the Critical Technologies
entity is classified HIGH
· the value of all instances (i.e., all
attributes) of the Client entity is
classified LOW
2b
identity of
selected instances
of an entity
· selected Research Projects will be
protected at LOW based on the
contents of Project Description as
determined by the classification
authority
· all attributes of the Research Project
entity are classified LOW when the
Description attribute contains LOW
information
· the identity of instances of all other
entities is UNCLASSIFED
Attribute Classification
3
existence of an
attribute (could
also be of
selected instances
of an attribute)
· the fact that specific Target Platforms
are identified for Research Projects is to
be protected at HIGH
· the existence of the Target Platform
attribute of Research Project is
classified HIGH
· all instances of the Target Platform
attribute are classified HIGH
· the existence of instances of all other
attributes is UNCLASSIFIED except
privacy information
4
value of an
attribute
· privacy information for employees (i.e.,
home address and phone number) is to
be protected at LOW
· the existence of the privacy information
attributes of employees is classified
LOW
Data Association
5
the association
between two
entities or two
attributes of
different entities
· the association between a specific
project and the sponsoring client must
be protected at HIGH
· the association between the Project
entity and the Client entity is HIGH
· the association between all other
entities is UNCLASSIFIED
6
the association
between two
attributes of the
same entry
· the association between name and
salary is to be protected at LOW
· the association between the Salary
attribute of the Employee entity and
Employee identifying attributes (name,
number) is LOW
Cardinal Aggregation
7
a collection of
instances of the
same entry
· the identity of Critical Technologies
must protected at HIGH
· the education entity and all its attributes
for all employees are UNCLASSIFIED
· the collection of the Degree and Major
attributes for all employees is LOW
Table 3.1: Taxonomy of Protection Requirements and Classification Rules
Entity classification rules can address the need to protect both the
identity as well as the existence
of instances of entities of interest. In some operational environments,
the fact that an entity (or
some selected instances of an entity) exists would disclose, directly
or through an inference,
classified information. In many other operational environments, knowing
that something classified
exists is not a problem; but you must not be able to know the classified
information (i.e., its identity
or value). A good example is a compendium of documents including some
that are classified. The
reader can know that the documents exist without having the clearance
to read them. Entity
classification rules from the ATC example include:
· The existence of all instances of the Critical Technologies
entity are classified LOW.
· The values of all instances (i.e., all attributes) of the Critical
Technologies entity are
classified HIGH.
· All attributes of the Research Project entity are classified
LOW when the Description
attribute contains LOW information.
Attribute classification rules are similar to entity classification,
except that they address specific
attributes of an entity. Attribute classification rules from the ATC example
include:
· The existence of the Target Platform attribute of Research Project
is classified HIGH.
· Instances of the Employee entity are UNCLASSIFIED, but the value
(not existence) of
two attributes, Home Address and Home Phone#, are classified LOW.
Data Association classification rules are stated when a relationship
between entities or attributes is
considered more sensitive than information about the entities or attributes.
This type of
classification rule implements inter- and intra-entity aggregation as
discussed in Section 2. The
ATC example includes these data association rules:
· The association between a specific project (classified LOW)
and the sponsoring client
(classified LOW) must be protected at HIGH.
· The association between an employee and the employee's salary
is to be protected at
LOW.
These three types of classification rules (entity, attribute, and data
association), along with
appropriate access control rules (e.g., the military classification system)
will define the basis for
preventing direct disclosure. Direct disclosure can be prevented if the
classification rules correctly
implement the protection requirements (i.e., are consistent and complete)
and if the AIS correctly
implements the rules.
On the other hand, eliminating inferences is difficult, if not impossible,
in that an individual's
capability to infer information is based on their already acquired knowledge
and the innovative
ways in which information can be combined [Morgenstern 88]. Much of the
knowledge that may
be used to infer unauthorized information is normally outside of the system
in the form of domain-
specific knowledge or even common sense. At one end of the spectrum, one
can ensure that no
unauthorized inferences are possible by classifying all information at
HIGH and then giving all
users a HIGH clearance. Although this approach solves the inference problem
(since every user is
authorized to access all information in the system), we now have introduced
new vulnerabilities
(e.g., users have clearances they should not have, and therefore access
to information they should
not see) and potentially decreased the operational efficiency of the organization
(e.g., having to
manually downgrade a large percentage of the information produced by the
system).
Establishing classification rules that will reduce unwanted inferences
involves a constant trade-off
between the desire to raise the classification of data to reduce inferences
and the desire to maintain
or even lower the classification of data to improve operational efficiency
without directly
disclosing classified data to those not cleared to see it. This process
of evaluating the relative
advantages of over-classifying information to prevent inferences is part
of the risk analysis activity
identified in Section 2.3.
At this point, a fourth type of classification rule, cardinal aggregation,
is often added. Cardinal
aggregation rules address the classification of collections of instances
of the same entity. Often,
these rules are established to reduce inference. At this point in the
framework we can conclude the
high-level relationships between inference and aggregation are:
· inference is a vulnerability, e.g., authorized users can deduce
information not authorized
by the protection requirements, and
· aggregation is a class of classification policies that are stated
as one form of control, in
many cases, to reduce the inference vulnerability.
The ATC example includes a cardinal aggregation classification rule:
· the education entity and each of its attributes for all employees
are individually
UNCLASSIFIED, but
· the collection of the Degree and Major attributes of all instances
of the education entity is
HIGH.
The ATC cardinal aggregation classification rule illustrates the difficulty
of enforcing the intent of
the rule, even though we have specified the information to be protected
(i.e., the identity of Critical
Technologies must be protected at HIGH). The difficulty is to determine
how many employees'
educational information has to be aggregated to allow an inference. Is
it all employees on a
particular project? Ten employees? Ten percent of the employees? All employees
with blue eyes?
To have an effective implementation of a cardinal aggregation classification
rule, someone has to
state explicit criteria for which collections must be considered HIGH.
Unfortunately, cardinal
aggregation classification rules introduce a new vulnerability, as discussed
in the next section.
At this point it is appropriate to distinguish between an aggregation
policy, such as those discussed
above, and a Chinese Wall policy. Brewer and Nash introduced an important
commercial security
policy called the Chinese Wall policy [Brewer 89]. They described the
policy as follows:
It can most easily be visualized as the code of practice that must be
followed by a market
analyst working for a financial institution providing corporate business
services. Such an
analyst must uphold the confidentiality of information provided to him
by his firm's clients;
this means that he cannot advise corporations where he has insider knowledge
of the plans,
status, or standing of a competitor. However, the analyst is free to advise
corporations which
are not in competition with each other, and also draw on general market
information.
This type of policy seems to involve aggregates of information and therefore
is sometimes
considered an "aggregation problem." In fact, the Chinese Wall
policy does not meet the definition
of an aggregation problem; there is no notion of some information being
sensitive with the
aggregate being more sensitive. The Chinese Wall policy is an access control
policy where the
access control rule is not based just on the sensitivity of the information,
but is based on the
information already accessed. The situation can be considered similar
to having compartmented
data and subjects which can be cleared for exactly one compartment. The
choice of which
compartment the subject is cleared for is effectively made when the first
compartmented data is
viewed.
3.2.3 Vulnerabilities from Classification Rules
Figure 3.3 identifies three residual vulnerabilities given a set of classification
rules designed to
prevent direct disclosure and reduce inference (i.e., the vulnerabilities
of information
confidentiality protection requirements):
· Flawed Rules that allow Direct Disclosure With complex operational
environments, the
classification rules can be flawed with respect to how faithfully they
implement the
information confidentiality protection requirements. They can be inconsistent,
incomplete,
or just incorrect such that they allow direct disclosure.
· Flawed Rules that allow an Unintended Inference Although the
entire set of classification
rules are intended to prevent unwanted inferences, when faced with a complex
set of
classification rules there may be opportunities for unintended inferences.
Again, this may
be caused by rules that are inconsistent, incomplete, or incorrect.
· Cardinal Aggregation If a cardinal aggregation rule has been
stated that classifies an
aggregate as LOW but individual elements are UNCLASSIFIED, then a vulnerability
has
been added since UNCLASSIFIED users may be able to assemble aggregates
in an
unintended or undetectable way.
3.2.4 Database Design
Database designers and application developers attempt to implement classification
rules by
accurately representing and attaching security labels to database objects.
The Relational Model
[Codd 70], one of several data models, is the model of choice for existing
commercial MLS
products. Figure 3.4 shows a relational model representation of the ATC
example using tuple-level
labeling.
The following are examples of how each type of classification rule can
be implemented in a
database design:
· The rule, "the existence of all instances of the Critical
Technologies entity is classified
LOW," is implemented by classifying the Critical Technologies table
metadata at LOW.
This is done by the database designer.
· The rule, "the value of all instances of the Critical Technologies
entity is classified
HIGH," is implemented by labeling each tuple in the Critical Technologies
at HIGH. This
is enforced by the user entering the data.
· The rule, "all attributes of the Research Project entity
are classified LOW when the
Description attribute contains LOW information," is implemented by
having tuples
labeled UNCLASSIFIED or LOW based on the content of the Description attribute.
This
is enforced by the user entering the data.
· The rule, "the existence of Research Projects that use
Critical Technologies are classified
HIGH," is implemented by classifying each Research Project tuple
at HIGH when the
project uses a Critical Technology. This is enforced by the user entering
the data.
· The rule, "the existence of the Target Platform attribute
of Research Project is classified
HIGH," is implemented by decomposing the Target Platform attribute
(which would
normally be an attribute in the Research Project table) into a separate
table, Project-Target.
All tuples in Project-Target are classified HIGH. Note that the metadata
can also be
classified HIGH to prevent obtaining the knowledge that projects have
a target platform
that could contribute to unauthorized inferences.
· The rule, "all instances of the Target Platform attribute
are classified HIGH," from an
implementation standpoint, is redundant with the previous rule. Thus,
this rule requires no
changes to the database design.
· The rules for data association between the Project and Client
entities are, "the value of all
instances (i.e., all attributes) of the Client entity is classified LOW,"
"the value of all
instances of the Project entity (except those that use Critical Technologies)
are classified
LOW," and "the association between the Project entity and the
Client entity is HIGH."
These rules are implemented by labeling tuples (and metadata) in the Client
table as LOW,
labeling tuples in the Project table as LOW (except those that are labeled
HIGH because
they use Critical Technologies), and constructing an additional table,
Client-Project, to
represent the data association between Clients and Projects. The Client-Project
table
(metadata and tuples) is classified HIGH. Note that in the absence of
a data association
rule, the association between Project and Client tables would be represented
by a foreign
key from the Project table to the Client table.
Figure 3.4: ATC Example Database Design
· The data association rule, "the association between Salary
attribute of the Employee entity
and Employee identifying attributes (name and number) is LOW," is
implemented by
decomposing the Salary attribute into a separate table, Employee-Salary.
All tuples in the
Employee-Salary table are labeled LOW.
· The cardinal aggregation rule established by the rules, "the
education entity and all its
attributes for all employees are UNCLASSIFIED" and "the collection
of the Degree and
Major attributes of the entity for all employees is LOW," is implemented
by labeling all
tuples in the Employee table as UNCLASSIFIED and requiring some (unspecified)
mechanism to control access to multiple records.
3.2.5 Vulnerabilities from Database Design
Given a database design that attempts to faithfully represent the established
classification rules,
four vulnerabilities are identified in Table 3.3:
· Faithful Representation of Flawed Rules A database design that
faithfully implements a
flawed set of classification rules will not fix the flawed rules.
· Flawed Design Allowing Direct Disclosure A flawed design may
allow a direct disclosure
at the schema level for uniformly classified data objects or at the content
level for data
objects where instances may be classified at one of several levels.
· Flawed Design Allowing Unintended Inference The flawed design
may allow unintended
inferences at the schema level for uniformly classified data objects,
or at the content level
for data objects where instances may be classified at one of several levels.
· Cardinal Aggregation Labeled LOW If the data is labeled LOW,
then a vulnerability
exists, since LOW users may be able to access aggregates in an unintended
way.
These vulnerabilities are related to the values, application semantics,
and the security labels
assigned to data objects, not to the functionality (beyond enforcing a
MAC policy) or assurance
provided by the TCB. The TCSEC class of a TCB that enforces MAC has no
impact on countering
these vulnerabilities. An Al TCB is just as effective (or ineffective)
as a B1 TCB in countering
these vulnerabilities.
3.2.6 Database Instantiation and Vulnerabilities
For completeness, Table 3.3 shows the database instantiation as the final
activity. Databases are
populated with data, and users use the database to conduct their job tasks.
The database
administrator or security officer is tasked with ensuring the system is
operating in a secure manner.
Assuming clearances are properly assigned to users, two vulnerabilities
are identified:
· Inappropriate Labeling Allowing Disclosure Current MLS systems,
including DBMSs,
depend on the user being logged in at the access class that correctly
represents the
classification of the data being added to the database. If a user with
a HIGH clearance
enters HIGH data when the user's current access class is LOW, then HIGH
data has been
directly disclosed, since users with only a LOW clearance can access the
HIGH data that
is incorrectly marked LOW.
· Inappropriate Labeling Allowing Unintended Inference In a similar
manner, a user with a
HIGH clearance may enter data at LOW that will allow other users with
only a LOW
clearance to infer HIGH data.
The TCSEC class of a TCB that enforces MAC also has no impact on countering
these
vulnerabilities.
3.3 OVERVIEW OF RESEARCH
This section provides a brief overview of the research that will be discussed
in Sections 4 and 5.
Each of the papers and research efforts on inference and aggregation address
one or more parts of
the framework. Table 3.2 lists formal approaches to defining the problems.
Table 3.3 maps
inference and aggregation research efforts against the framework in three
categories (techniques,
mechanisms, and analysis tools). As Table 3.3 shows, much of the research
has been in developing
techniques and designing mechanisms. There has also been limited work
in developing tools to
help address inference and aggregation. The remainder of the table shows
which vulnerabilities
each of the analysis tools and mechanisms attempt to address.
Inference
Aggregation
Database Partitioning (Denning)
Set Theory (Goguen and Meseguer)
INFER Function (Denning and Morgenstern)
FD and MVD (Su and Ozsoyoglu)
Sphere of Influence (Morgenstern)
Fact Model (Cordingley and Sowerbutts)
Inference Secure (Lin)
Inference Secure (Marks)
Imprecise Inference Control (Hale)
Security Algebra (Lin)
Modal Logic Framework (Cuppens)
Table 3.2: Formal Definitions
Techniques
Mechanisms
Analysis Tools
Classification Rules
Entity and Attribute
Direct Association
Database Design
(Denning, Lunt)
Cardinal
Aggregation
Classification Rule Vulnerabilities
Direct Disclosure
Unintended
inference
Cardinal
Aggregation Rules
Faithfully
implements flawed
classification rules
Table 3.3: Research Summary (Part 1)
Techniques
Mechanisms
Analysis Tools
Data Design Vulnerabilities
Direct Disclosure at
schema level
Secondary Path Analysis
(Hinke, Binns)
Direct disclosure at
content level
Unintended
inference at schema
level
Basic Security Principle
(Meadows)
Semantic Data Models
(Buczkowski, Hinke,
Smith)
Classification
Constraints (Akl)
Constraint Satisfaction
(Morgenstern)
Integrity Constraints
(Denning)
Secondary Path Analysis
(Hinke, Binns)
DISSECT
(Garvey et al.)
AERIE
(Hinke and Delugach)
Constraint Processor
(Thuraisingham et al.)
Hyper Semantic Model
(Marks et al.)
Unintended
inference at content
level
Cardinal
Aggregation-LOW
Data
Separae Structures
(Lunt)
Use of Views (Wilson)
Restrict Ad hoc Queries
Use of Audit
Some or All Data HIGH
Aggregation Constraints
(Haigh and Stachour)
Aggregation Detection
(Motro et al)
Database Instantiation Vulnerabilities
Labeling allows
disclosure
Constraint Processor
(Thuraisingham et al.)
DB Inference Controller
(Buczkowski)
Secondary Path
Monitoring (Binns)
Polyinstantiation
(Denning)
Auditing Facility
(Haigh and Stachour)
Snapshot Facility
(Jajodia)
Labeling allows
unintended inference
Table 3.3: Research Summary (Part 2)
SECTION 4
APPROACHES FOR INFERENCE CONTROL
Providing a solution to the inference problem is beyond the capability
of currently available MLS
database management systems. It is also recognized that a general solution
to the inference
problem is not possible given the inability to account for external knowledge
and human reasoning
[Binns 94a]. To provide partial solutions, the inference problem must
be bounded.
Section 4.1 identifies several efforts by researchers to formally define
the inference problem. These
approaches are summarized to provide different perspectives on the problem.
Implementing
theoretical inference solutions has proven to be difficult. Section 4.2
describes database design
techniques that could be used to prevent inference problems from occurring.
Mechanisms for
inference prevention and detection during run-time are described in Section
4.3. Finally, Section
4.4 describes several tools that assist the database designer to detect
and avoid inference channels.
4.1 FORMALIZATIONS
Researchers have proposed numerous ways to characterize or define the
inference problem. This
section presents several formal definitions from the ongoing research
into the discovery of
fundamental laws that determine whether the potential for undesirable
inferences exist.
4.1.1 Database Partitioning
Database partitioning is used by Denning to define the inference problem
[Denning 86a]. For each
user *, the data in the database can be partitioned into two sets: a VISIBLE
set and an INVISIBLE
set, as in Figure 4.1. The user * is allowed to access only elements from
the VISIBLE set, and in
is not allowed to know what elements are in the INVISIBLE set. (Note that
if VISIBLE is null, the
user has no access to database information and, therefore, the problem
is outside the domain of the
DBMS. In what follows, we assume that VISIBLE is not null.) Let KNOWN
denote the set of data
elements that is known to *. The set KNOWN is constructed by the user
either as a result of the
previous queries or as a result of some external knowledge that in possesses.
There is no inference
problem if the intersection of the two sets INVISIBLE and KNOWN is empty,
as in Figure 4.2.
There is an inference problem if the two sets INVISIBLE and KNOWN intersect,
as in Figure 4.3.
Figure 4.1: Partitioning of the Database for a User U
Figure 4.2: Database without an Inference Problem
Figure 4.3: Database with an Inference Problem
4.1.2 Set Theory
Another formulation of the inference problem has been given by Goguen
and Meseguer [Goguen
84]. Consider a database in which each data item is given an access class,
and suppose that the set
of access classes is partially ordered. Define the relation > as follows:
Given data items x and y,
relation x > y is said to hold if it is possible to infer y from x.
The relation > is reflexive and
transitive. A set S is said to be inferentially closed if whenever x is
in S and x > y holds, then y
belongs to S as well. Now, for an access class L, let E(L) denote the
set consisting of all possible
responses that are classified at access classes dominated by L. There
is an inference problem if E(L)
is not inferentially closed.
Goguen and Meseguer do not set forth any one candidate for the relation
>; they merely require
that it be reflexive and transitive, and say that it will probably be
generated according to some set
of rules of inference (for example, first order logic, statistical inference,
monotonic logic, and
knowledge-based inference). They do note, however, that for most inference
systems of interest,
determining that A > b (where A is a set of facts and b is a fact)
is at best semidecidable; that is,
there is an algorithm that will give the answer in a finite amount of
time if A > b, but otherwise
may never halt.
4.1.3 Classical Information Theory
Yet another definition that uses classical information theory has been
given by Denning and
Morgenstern [Denning 86a]. Given two data items x and y, let H(y) denote
the uncertainty of y, and
let Hx (y) denote the uncertainty of y given x (where uncertainty is defined
in the usual information-
theoretic way). Then, the reduction in uncertainty of y given x is defined
as follows:
The value of INFER (x > y) is between 0 and 1. If the value is 0,
then it is impossible to infer any
information about y from x. If the value is between 0 and 1, then y becomes
somewhat more likely
given x. If the value is 1, then y can be inferred given x.
This formulation is especially useful since it shows that inference is
not an absolute problem, but
a relative one. It provides a way to quantify the bandwidth of the illegal
information flow. On the
other hand, Denning and Morgenstern point out its serious drawbacks [Denning
86a]:
1. It is difficult, if not impossible, to determine the value of Hx (y).
2. It does not take into account the computational complexity that is
required to draw the
inference.
To illustrate the second point, they give the following example from
cryptography: With few
exceptions, the original text can be inferred from the encrypted text
by trying all possible keys (so
Hx (y) = 1 in this case); however, it is hardly practical to do so.
4.1.4 Functional and Multivalued Dependencies
Su and Ozsoyoglu have studied inference problems that arise because of
the functional and
multivalued dependencies that are constraints over the attributes of a
relation [Su 86,87,90].
Functional dependencies are defined below; the reader is referred to Date
for a definition of
multivalued dependencies [Date 86].
Let R be a relation scheme defined over attributes *, and let X and Y
be subsets of in. A functional
dependency X > Y is said to hold in R if for any relation r, the current
instance for R, r does not
contain two tuples with the same values for X, but different values for
Y; that is, given any pair of
tuples t and s of r, t[X] = t[X] implies t[Y] = s[Y].
Su and Ozsoyoglu illustrate how inference problems can arise if certain
functional dependencies
are known to the low-level users. If attributes are assigned security
labels in a manner that is
consistent with the functional dependencies, then these inference threats
can be eliminated. This
process is formalized by Su and Ozsoyoglu.
4.1.5 Sphere of Influence
Morgenstern expands upon the INFER function (described in Section 4.1.3)
to propose a
theoretical foundation for inference [Morgenstern 87,88]. He proposes
a framework for the
analysis of logical inferences and for the determination of logical inference
channels. Morgenstern
introduces the concept of a sphere of influence (SOI) and inference channel.
The SOI delimits the scope of possible inferences given some base data,
called the core that may
consist of entities, attributes, relationships, and constraints. Specifically,
the sphere of influence
relative to some core, i.e., SOI (core), is the set of all information
that can be inferred from the core.
The SOI is defined in terms of the INFER function. The SOI models the
process by which a user's
knowledge of an application can give rise to inferences about additional
information. The SOI
utilizes a forward chaining inference process from the given core to determine
the scope of such
inferable data.
Morgenstern's concept of inference channel serves to isolate the lower
level data which could give
rise to inferences about higher level data H. The computation of an inference
channel is a
backward-chaining inference process from some resultant data H to determine
all information
which contributes to upward inferences about H. An inference channel exists
if information about
some set of data H may be inferred from data in another set C which is
at a lower level than H, or
is incomparable relative to H in the classification lattice.
4.1.6 The Fact Model
Sowerbutts and Cordingley provide a formal definition of inference based
on an abstract data
model called the Fact Model [Cordingley 89, 90; Sowerbutts 90]. They identify
two aspects of the
inference problem: static inference and dynamic inference. Static inference
is knowledge of the
database together with authorized facts that can be used to determine
unauthorized facts. Dynamic
inference is where database operations can be used together with knowledge
of the structure of the
database to determine unauthorized facts. The Fact Model addresses static
inference.
The Fact Model is a model of knowledge described as a set of facts and
a set of constraints. The
authors provide a language for specifying application-dependent facts
and constraints. The Fact
Model provides a formal statement of the standard constraints, such as
functional and multivalued
dependencies, and defines explicitly the inference threat associated with
each [Cordingley 89]. The
relationship between facts and subfacts is defined by construction rules.
An INFER function is
defined on facts, constraints, and construction rules. Given an arbitrary
subset of facts and
constraints, the INFER function identifies all the facts that are implicit
in the given subset, and
hence can be inferred by a user. At this point the classification attribute
can be added for each fact.
This approach forms the basis for assigning classifications such that
any facts inferable from a
given subset of facts whose classifications are dominated by a classification
level k also have a
classification level dominated by k. The authors call this a Static Inferentially
Secure Database.
The Fact Model also is extended to represent operations such that a Dynamic
Inferentially Secure
Database is defined. The Fact Model was developed as an abstraction of
the relational model but
can be used to describe any conventional database structure such as network
or hierarchical.
4.1.7 Inference Secure
Lin introduces the concept of navigational inference as the process of
accessing HIGH data via
navigating through legally accessible LOW data [Lin 92]. Lin states that
inference problems exist
if the security classifications of data are inconsistent with some database
structures. He identifies
three such inference problems:
· Logical Inference Problems The security classification of a
theorem (a derivable formula)
in a formal theory should be consistent with its proof. If not, then logical
inference
problems arise.
· Algebraic Inference Problems The security classification of
"algebraically derivable data"
should be consistent with its relational algebraic structure. If not,
algebraic inference
problems arise.
· Navigational Inference Problems The security classification
of data should be consistent
with its navigational operators (which generate navigational paths). If
not, navigational
inference problems arise.
Lin defines a multilevel data model as inference secure if:
(1) it is a Bell-LaPadula Model (secure under MAC), and
(2) it is navigational inference free.
Lin provides further descriptions and theorems to support the definitions
[Lin 92].
4.1.8 An Inference Paradigm
Marks defines database inference as [Marks 94b]:
Inference in a database is said to occur if, by retrieving a set of tuples
{T) having attributes
{A} from the database, it is possible to specify a set of tuples {{T'),
having attributes {A'),
where {T') is not a subset of {T) or {A') is not equal to {A).
The definition may be stated as an inference rule: IF ({T), {A}) THEN
({T'}), {A'}), which may
be denoted as ({T}, {A}) > ({T'}, {A'}). However, it may be possible
to retrieve a set of tuples
{T} and the associated attributes {A} from a database and to reason, using
data and attributes
outside the database, to arrive at a set of tuples {T'} and attributes
{A'} that are again within the
database. That is, it is possible to form a chain of reasoning, ({T},
{A}) > ({T1}, {A1}) > ({T2},
{A2}) >... > ({T'}, {A'}) where some of the tuples and/or attributes
are outside the database.
When some of the data or attributes are outside of the database, the chain
of reasoning cannot be
followed by the database system. Fortunately it is not necessary to actually
follow such a chain of
reasoning in order to control the inference threat. If the database contains
the endpoints of the chain
(({T}, {A}) and ({T'}, {A'})) there will exist what is referred to in
logic systems as a material
implication relating the two sets.
If a material implication exists such that ({T}, {A}) > ({T'}, {A'})
where ({T}, {A}) is classified
LOW and ({T'}, {A'}) is classified HIGH, then the database can offer no
assurance that there does
not exist some chain of inference, using outside knowledge, that can connect
the two, enabling a
LOW user to infer HIGH information. If, however, ({T}, {A}) not > ({T'},
{A'}) within the
database, then it can be guaranteed that no chain of inference, using
outside knowledge or not,
exists which connects the two sets. That is, the absence of a material
implication between two sets
of data is sufficient to guarantee the absence of any chain of reasoning
between the sets of data. It
is not necessary for inference control, however, since material implications
may be coincidental,
and not related to any reasoning process. These arguments may be reduced
to:
Limitations on Database Inference: ({T}, {A}) > ({T'}, {A'}) is an
inference rule capable
of being controlled by the database if and only if all the tuples {T}
and {T'} are in the database
and all the properties {A} and {A'} are attributes in the database.
4.1.9 Imprecise Inference Control
Hale, Threet, and Shenoi introduce a powerful, yet practical, formalism
for modeling and
controlling imprecise functional dependency (FD) based inference in relational
database systems
[Hale 94]. The existence of an imprecise FD implies that if some tuple
components satisfy certain
equivalencies, then other tuple components must exist and their values
must be equivalent.
Imprecise FDs can specify constraints on precise and imprecise data. Examples
of imprecise FDs
are: "engineers have starting salaries about 40K" and "approximately
equal qualifications and
more or less equal experience demand similar salary."
An imprecise FD is formally defined from which the authors formally define
an imprecise
inference channel. The non-formal definition of an imprecise inference
channel is a chain of
imprecise FDs. To control and ultimately eliminate compromising imprecise
inferences, it is
necessary to specify a set of imprecise inference channels considered
to be secure. This
compromise specification set is defined by the database administrator.
Suspect channels in a
database are compared with these secure channels. A potential security
compromise exists when a
suspect channel allows the inference of information which is not coarser
than information inferred
by any secure channel. Potentially compromising imprecise inference channels
are eliminated by
hiding relations attributes or by "clouding" data manifested
by imprecise FDs in the compromising
channels.
4.2 DATABASE DESIGN TECHNIQUES
The problem of deciding how to label multilevel database objects - data,
schemata, and constraints
- should not be a problem for cleared database designers and users [Millen
91]. They ought to know
the classification of whatever they are entering into the database. Unfortunately,
many inferences
are not simple ones, and the number and complexity of potential inferences
can be quite large.
Thus, raising the question of how well a person can anticipate such inferences
in order to classify
the data [Morgenstern 88]. The following subsections examine techniques
for avoiding inference
problems in the first place, and how to detect inference channels during
database design.
4.2.1 Inference Prevention Methods
Ideally, if all unauthorized disclosures are to be prevented, the Basic
Security Principle [Meadows
88a, 88b] should be followed:
Basic Security Principle for Multilevel Databases - The security class
of a data item should
dominate the security classes of all data affecting it.
The reason for the Basic Security Principle is clear: if the value of
a data item can be affected by
data at levels not dominated by its own level, information can flow into
the data item from data at
other levels.
The task of predicting or detecting all inference problems appears to
be very difficult. However,
many of these problems can easily be prevented by careful consideration
of the data items on which
a data item is predicated. In practice, unfortunately, two problems with
the Basic Security Principle
have been identified:
1. The number of potential inferences can be quite large.
2. All possible inferences cannot be anticipated.
The combination of these problems makes the task of appropriately classifying
data complex.
However, several techniques that have been developed for dealing with
inferences are discussed
below. If information x permits disclosure of information y, one way to
prevent this disclosure is
to reclassify all or part of information x such that it is no longer possible
to derive y from the
disclosed subset of x. There are two approaches to dealing with violations
of this type: reclassify
either the data or the constraints appropriately. These approaches are
discussed next.
4.2.1.1 Appropriate Labeling of Data
One of the approaches suggested to handle the inference problem is to
design the multilevel
database in such a way that the Basic Security Principle is maintained
[Binns 92a; Burns 92; Hinke
92; Garvey 93; Smith 90b; Thuraisingham 90c]. Security constraints are
processed during
multilevel database design and subsequently the schemas are assigned appropriate
security levels.
Several researchers have proposed that semantic database models be used
to detect (and then
prevent) some inference problems [Buczkowski 90; Hinke 88; Smith 90b,
91]. Conventional data
models (such as hierarchical, network, and relational data models) use
overly simple data
structures (such as trees, graphs, or tables) to model an application
environment. Semantic
database models, on the other hand, attempt to capture more of the meaning
of the data by
providing a richer set of modeling constructs. Since integrity and secrecy
constraints can be
expressed naturally in semantic database models [Smith 90b], they can
be used to detect inference
problems during the database design phase.
Classification (or secrecy) constraints can also be used to eliminate
inference problems [Akl 87].
Classification constraints are rules that are used to assign security
levels to data as they are entered
into the database. Classification constraints are required to be consistent
(meaning that each value
is assigned a unique security level) and complete (meaning that rules
assign each value a security
level). Inconsistent classification constraints indicate potential inference
problems, and incomplete
classification constraints point to incomplete labeling rules. Two different
methods have been
proposed for determining the consistency and completeness of the classification
constraints. One
is based on logic programming [Denning 86b], and the other is based on
computational geometry
[Akl 87].
Morgenstern views the overall process of classifying a database as a
constraint satisfaction
problem of the following form [Morgenstern 88]. Each potential inference
which involves one or
more relations or objects (data) from the database serves as a constraint.
It limits the classifications
which can be assigned to an object given the classification(s) of the
other object(s). In some cases,
the constraint may uniquely determine the classification of the remaining
data object. Morgenstern
defines safety for inference: a data object O, and its assigned classification
label, are said to be safe
for inference if there is no upward inference possible from O. That is,
object O cannot be used to
infer information about other data objects at higher or incomparable levels
of the classification
lattice. A classification level is safe for inference if all the data
objects having that classification
are safe.
The techniques described in this section rely on increasing the classification
of one or more
elements in an offending path following its detection. Note that in practice,
however, it is not
always possible to solve inference problems by raising the classification
level of data that cause
inference violations. Some data may be pub |