|
Microsoft® Office
System XML File Formats
Download .pdf (126KB)
Introduction
This document provides information regarding Microsoft Office System
XML file formats. Unlike HTML, XML is a meta-language enabling companies
to define their own file formats through the creation of custom-defined
XML schemas. The Microsoft Office System supports the creation and
manipulation of an unlimited set of XML file formats as needed by
customers to define the content of their business documents. The
Microsoft Office System also supports two fully documented
application-specific document schemas (Word ML, introduced in Office
2003 and Spreadsheet ML, introduced in Office XP) that primarily
function to store information about document display and Word and Excel
specific functionality pertaining to specific Microsoft integrated
innovations.
Abstract
With many companies and industries having seen great benefit from the
use of XML, primarily for exchanging data between back-end servers, the
introduction of XML-enabled desktop applications with support for
customer-defined schemas carries the potential for more cross-platform
integration for even greater data access and collaboration benefits. The
growing industry trend to support customer-defined schema is enabling
information workers to create and interact with documents that contain
regions of meaning, in the same way that information in a structured or
relational database has meaning. With such support, XML brings the power
of traditional data management to bear on documents, facilitating reuse,
indexing, search, storage, aggregation, and other practices more often
associated with management of relational databases. The Microsoft Office
System's XML support for customer-defined schemas in the 2003 versions
of the Office applications puts XML in the hands of information workers
and enables businesses to reap the full benefit and promise of this
revolutionary technology.
1. The Value of XML
Today, XML is a widely accepted industry standard
that enables exchange of data between disparate systems. The use of XML
and the advent of the XML Web services architecture is set to
revolutionize the way many companies—or, in some cases, entire
industries—transact business on the Web, However, XML represents more
than a simple data or document exchange protocol; its founders
originally envisioned XML as a way to capture more of the meaning
locked within business documents by defining the structure and context
of the information these documents contain instead of only describing
the way those documents are displayed.
Although there are well-established methods for
storing and managing data types (for example, numerical data in
databases), a significant portion of the information created in the
business environment is not captured in any meaningful way and thus
can't really be accessed or reused. Workers everywhere generate reports,
e-mail messages, documents and spreadsheets that contain vital, valuable
information. But if they need to reuse this information, these same
workers may also spend significant time searching for the appropriate
files and subsequently spend time "and effort to re-key, cut, paste, or
otherwise import the relevant information into another document. The way
these documents are created and handled tends to limit the extent or
ease with which the information can be used outside the original
document.
While data capture and validation is a well
established methodology for traditional data management, the technology
to similarly gather and manage the information contained in text-based
reports and other common business documents has not been available. This
was the problem that XML's creation solves, through enabling the use of
custom-defined schemas. XML enables businesses to capture all manner of
business information in a way that maximizes its value. By facilitating
reuse, indexing, search, storage, aggregation, and other practices more
often associated with management of relational databases, XML brings the
power of traditional data management to bear on documents.
XML enhances interoperability across heterogeneous
and cross-platform systems at the new and fundamental data level for
documents, allowing data to be restructured, aggregated and presented in
new and different ways rather than simply allowing a second platform to
display the same data in the same order.
2. How does XML enable the creation of custom document file formats
through custom-defined schemas?
XML is a markup language that is
used to identify structure within a document. The XML standard is
published and maintained by W3C, the consortium that maintains many of
the standards for the World Wide Web. Microsoft has designed its Office
applications to adhere to the W3C's published standard.
Like other markup languages, XML uses tags
to define specific elements within a document. XML tags define the
document's structural elements and the meaning of those elements. Unlike
HTML tags, which specify how a document looks, or is formatted, XML can
be used to define the document structure and content—not just the look
and feel. By doing so, XML separates a document's content from its
presentation, thereby enabling developers to access and manage this
content in many meaningful ways.
The tags that can be used for a particular
document type or information type are contained in what are called
XML schemas, which define the set of tags and the rules for
applying them. Schemas, thus, define the structure and type of data that
each data element in a document can contain and schemas can be created
by a number of entities such as a user, a company, or an industry for
example.
With that understanding in mind, XML schemas can
be created to define and qualify content for virtually any application.
For example the information that can be found in documents about
finance, insurance, health care, automobile manufacturing, government
regulation, insurance claim processing or visa applications can be very
different (or identical) for each type of document. In order to best
describe the inherent differences, different XML schemas are needed to
describe the type of information in each document. While the underlying
structure of a document enriched with schemas is XML, defining the type
of information in documents is in essence creating a file format for
each different type of document: Financial tags such as <Amount> or <ROI>
will be used in a file format for financial documents, and tags such as
<DateOfTreatment> or <PatientRecord> for healthcare documents.
With the use of custom-defined XML schemas, needed
information can easily be properly extracted from any document at any
time and interchanged with any organization or other application using
database-related techniques.
In spite of this potential, XML has yet to be
fully exploited through desktop applications. With the XML support for
customer-defined schemas now introduced in the 2003 versions of the
Office applications, customers can begin to reap the full benefit and
promise of this revolutionary technology through better capture and
reuse of information, more easily connecting to data and
intelligent applications.
Capture and Reuse of Information
XML-based documents enable organizations to
capture more of the intellectual property that is created on an ongoing
basis. Customer-defined schemas are analogous to the columns and tables
in a database; thus, documents of all kinds become a source of
information as rich as any other operational data store. Once captured,
this information becomes a very valuable corporate asset. By defining
their own XML schema, organizations gain the ability to decide exactly
what data to capture and now this data is structured. With documents
functioning at this level of "storage," companies have the capability to
aggregate, parse, search, manage, and reuse documents, document
fragments and domain knowledge in the same way they do their other
business data.
For users, the ability to search for specific
information and to aggregate information from numerous sources
eliminates many of the time-consuming, error-prone tasks associated with
document creation and update; for example, opening and closing files to
find information; cutting and pasting information between documents; and
searching for labels to combine data in like fields.
Connecting Users to Data
XML is widely regarded as a standard for data
exchange and for exposing the information contained in databases or
back-end systems. Built on open, industry standards, today's XML web
services provide a universal way to connect users to data in order to
allow communication between business systems and data sources, or
between systems that are written in different languages on different
platforms. By providing XML-enabled applications on the desktop,
companies can take advantage of a Web services infrastructure to empower
employees and enable them to connect directly to enterprise systems and
data sources.
Using Office technologies, companies can take full
advantage of XML data and XML Web
services by accessing the information directly and then dynamically
surfacing the right
information, in the right form, to where it's needed in the spreadsheet
or word processor for analysis, formatting, publishing or other type of
processing. The result is a flexible, richer, more integrated desktop
environment. Throughout, the information retains its meaning from
customer-defined XML tags (healthcare tags such as <PatientRecord>
retain their meaning in the word processor or in the spreadsheet, even
if the <PatientRecord> data is presented in a formatted bold paragraph
in the word processor or in a table grid in the spreadsheet). Without
these, the information would become a random group of bytes that could
not be subsequently reused in an intelligent or automated fashion.
Intelligent Applications
XML offers exceptional potential for automating
virtually any task that involves working with documents. Creating
documents such as reports, spreadsheets, and forms with an attendant XML
schema—even if that schema is hidden to the users—enables developers to
build interoperable solutions that recognize the structure and meaning
of the content within those documents and respond intelligently to the
user. Information from customer-defined schemas can also be used to
validate information or data as it is entered, avoiding errors and
aiding in data cleansing and standardization.
The ability for companies to define their own
schemas allows them to identify the unique regions of meaning within
their documents and to create solutions that correlate these structures
to their own business processes. Because the actual content is separate
from the presentation of the content, these solutions can be tailored to
display the same information in many different ways, as appropriate for
a particular task, user, or process.
Moreover, the ability to identify sections of a
document structurally—or to recognize specific content within a
section—allows developers to create applications that respond
intelligently to user input, offering context-sensitive actions and
guidance, suggesting required content, or providing supporting data or
links to related information.
Because the client software understands the
content in the document, through the custom-defined XML tags,
intelligent applications present endless possibilities for helping users
interact with documents. The advent of such solutions will revolutionize
the way users create and work with documents. Intelligent applications
will guide and facilitate the creation of documents, reducing the time
spent on traditionally manual tasks.
3. Microsoft Word, Excel and Access support for custom-defined XML
schema
XML support in Microsoft Word 2003 enables
authoring and saving of rich content in custom file formats based on
customer-defined XML schemas, enabling the repurposing of document
content across devices, platforms and processes. Support for
customer-defined XML schemas enables users to preserve or extract from
the document selectively the data or structural elements of interest to
a particular application. In either case, users can create documents
containing information marked by XML tags belonging to the
custom-defined schema in a completely intuitive fashion; users need not
learn or understand the concepts behind XML to realize the full benefit.
Organizations, on the other hand, greatly benefit from the
aforementioned advantages of using a custom-defined XML schema to insure
that the information the user enters is of a high quality, has
contextual meaning for the business process at hand and can be easily
reused.
Customers who use Microsoft Excel 2003 for
importing and analyzing business data will benefit from the enhanced XML
capabilities introduced in this version. Like Word 2003, Excel 2003 can
read custom XML files as defined by customer-defined schemas. This
enables Excel 2003 to act as a smart client for XML Web services and a
host for smart document solutions that require analytical and
calculation capabilities rather than rich text formatting.
Additional Excel XML capabilities include a visual
tool for ease of mapping between the spreadsheet grid and
customer-defined XML schema. This enables developers or power users to
more easily import or export data in Excel to or from enterprise data
stores or Web services.
Microsoft Access 2003 enables Office users to
import and extract XML data from database tables using custom-defined
schemas and XML transforms. Access 2003 also enables the creation of a
custom-defined XML schema derived from the database schema. These
capabilities facilitate the integration of Access data with related
business processes or documents and allow users to control exactly how
the data is represented in XML.
Other Uses of XML in Microsoft Office System applications
New with the Microsoft Office System, InfoPath
2003 uses a forms metaphor to capture information according to a
customer-defined XML schema. InfoPath enables customers to gather and
reuse information with predefined structure (pre-tagging) and as part of
a business process. InfoPath supports only XML file formats based on
customer-defined schemas enabling users to interoperate with any
Microsoft or non-Microsoft platform that produces or consumes XML files
belonging to the customer's XSD.
FrontPage 2003 lets users quickly build
high-quality, data-driven Web sites that present dynamic views of
information from enterprise systems or local data stores. FrontPage
supports a complete set of tools for creating and editing Web pages that
connect to a variety of data sources, including XML files that follow
customer-defined XML schemas, databases and XML Web services. Users
control how data will be displayed in a Web page by creating XSL-T
transforms using an intuitive, graphical editor. These data views
include industry-standard reporting tools for sorting, grouping,
filtering, and conditionally formatting data. By supporting XML files
that follow customer-defined schemas, FrontPage enables users to
interoperate and construct web sites using data that have been created
on Microsoft or non-Microsoft platforms.
Visio 2003 drawing and diagramming software gives
users the capability to integrate information from a database into a
diagram. Diagrams saved as Visio XML files could incorporate XML data
that follows a customer-defined schema and can later be mined to
retrieve data from within the
diagram. This enables
developers to create rich Visio solutions for modeling business
processes, or that associate data from any XML data source with specific
shapes or diagram elements.
Interoperability and Heterogeneous, cross-platform data interchange
Support for custom-defined schema in Office 2003 is the fundamental
enabler for data interoperability. Documents can be created in Office
2003 following the XML format defined by the customer using the W3C XSD
standard. Any Microsoft or non Microsoft XML environment, client or
server that support the W3C XML and XSD standards can then consume those
documents. XML documents created by Microsoft or non-Microsoft systems
belonging the customer-defined XSD can be read and analyzed by the
Office 2003 system. The wide adoption of XML standards along with XML
customer-defined schema capabilities across the Microsoft Office System
open doors to many new innovative applications that can lead to better
use and reuse of information.
4. Word and Excel support for Application-specific XML Document
Schemas
In addition to Microsoft supporting the W3C XML
standards and integrating innovations such as custom-defined schema and
XML Web services, Microsoft takes XML to another dimension by offering
Spreadsheet ML and Word ML to provide customers with added
functionalities.
Microsoft Excel 2002 introduced Spreadsheet ML, a
display-oriented XML file format that uses XML tags to store display and
presentation characteristics and spreadsheet functionality. For example,
this display-oriented XML schema uses a <cell> tag and a <row> tag. This
file format is useful in scenarios where customers want to dynamically
construct a spreadsheet file on a server without using Excel directly,
which can be done using XML. However, while data can be easily accessed
and retrieved, it is difficult for this spreadsheet display format to be
used in output or storage scenarios since the row, cell and column
information isn't descriptive enough for subsequent business use. Data
is better expressed with XML tags chosen by an organization and
reflecting the content of the data (such as a <price> tag or a
<Monthly-Results> tag). Early indications from the Microsoft Office 2003
beta program lead us to expect that the majority of Excel users will use
the new customer-defined schema capabilities to import and analyze XML
data in Excel 2003. The XML file format for Excel 2003 stores the
customer-defined schema information in the same file with the other
spreadsheet XML tags and standard XML techniques and tools can be used
to easily extract any subset of information for reuse.
Microsoft Word 2003 introduces Word ML, a
display-oriented XML file format that preserves the formatting and
presentation of the Word document, including formatting, hyperlinks,
paragraphs, tables and styles. Word ML also provides storage information
for the entire feature set of Word 2003, including the new, advanced
capabilities around smart tags, smart documents and range permissions.
In the same way as with Excel, we expect Word ML to be used in a limited
context to create or preserve the formatting of documents, however most
customers are expected to use the support for customer-defined XML
schemas along with Word ML. The Word XML file format stores the
customer-defined schema information in the same file as Word ML to allow
customers to easily extract the information they need while being able
to easily manage the storage of one physical file.
For example, one could imagine transforming Word
ML to another display-oriented schema. Other useful scenarios include
server-based processes that add Word ML markup to existing text files,
XML files or data for display purposes, transforming any XML document to
a Word ML document, generating Word ML documents without using Word or
creating a Web service that produces documents in Word ML format. All
these are geared toward displaying data in a rich format in Microsoft
Word 2003.
Microsoft provides full documentation of the
WordML and Spreadsheet ML schema file formats. We expect Word ML and
Spreadsheet ML to grow with each new version of Word and Excel and
market adoption and feedback will ensure a continuous growth of the file
formats to reflect the new functionalities and innovations of the new
versions of the products.
5. XML and the Interoperability of Rich Authoring Tools
Unlike HTML, XML is a meta-language enabling
customers to define their own data-interchange, document file formats,
allowing customers to achieve data exchange interoperability in a
heterogeneous environment. XML also permits software vendors to
differentiate their product offerings through innovation in how data is
presented or displayed even as they support data interchange
interoperability through support of customer-defined schema. This
promotes innovation and competition between product offerings, which is
a benefit to customers.
Contrast this with what would happen if there were
only one schema, which controlled both how data interchange occurs and
information on how the data must be presented. Customers would be unable
to define the business-specific organization and display of information
that they needed, and additionally, innovation in presentation and
display of data by vendors offering software products and services would
be inhibited.
In such a case, some of the tags in a document
refer to presentation and display functionalities (such as table editor
features or a page layout editor) that have to be implemented by every
tool with every detail. If a standards body picks a set of presentation
and display tags that are supported by one vendor, it could disadvantage
other vendors who might have presentation and display functionality that
is more preferable for their customers.
An example of a problem that could be encountered
in the above scenario is the following: Assume that one product supports
a presentation and display feature to enable the automatic positioning
of all referenced images at the end of a document. If a user of another
product, which does not support this feature, tries to open a document
that uses this feature, the images will be at best displayed inline in
the document; at worst they would not be displayed at all. When the user
edits the file, he could be changing important paragraphs that are
linked to images he cannot see. The user may not even be aware the
document has been changed because he edited page 1 and the images were
supposed to be in page 4. To the user who originally created the
document, this will seem like document corruption. The vendor of the
second application or the user might be under the impression that the
vendor of the application with the feature has made it "too difficult"
to share files by virtue of its implementing the new feature. This
situation underscores a fundamental tension between the desire for
display and presentation-oriented uniformity, and the competing desire
that users express for new innovative display and presentation features
that provide greater value. As with many complex issues, a solution
likely rests with an understanding that display-oriented consistency
should be pursued to the extent possible without sacrificing or
undermining the software industry's ability to innovate and provide
greater value for its customers.
Despite the challenges inherent in achieving both
data interoperability and display/presentation consistency between
products, there has been amazing progress over the last several years,
both in the ability to achieve rich data exchange without loss of data
context, as well as the ability to achieve the exchange of documents
while preserving increasing levels of display and presentation
consistency. Today, customers of XML and XML Web services create
custom-defined schemas and then use technologies such as XSLT (XSL
transforms) or products such as Microsoft BizTalk server, IBM WebSphere,
WebMethods or BEA WebLogic to implement easy transformations between
custom-defined schemas or display oriented schemas.
6. The Tradeoffs of Mandating a Standard XML Document Format
Some have debated the merits of establishing a
"standard" document format that would be enforced by government or
legislative mandates, a discussion driven in part by the legitimate
desire on the part of computer users for improved interoperability
between competing products. But efforts to mandate a single, vendor
neutral data and display format for all documents would stifle software
innovation by causing all software to allow reuse and display of data
according to a "lowest common
denominator," This approach would likely decrease competition over time,
it is notable that many open file formats exist today but each has
followed a history that balances interoperability and innovation,
Open formats such as HTML, Rich Text Files (RTF) and ASCII each promoted
data interoperability across competing products, but none were ever
expected to achieve both data interoperability and display/presentation
uniformity, nor were they typically used as the default formats for
saving a document. Instead, these formats were only presented as yet
another option for consumers to choose from, to facilitate
interoperability and file exchange at a data level.
The history of HTML explains in part why mandating
a single, vendor neutral presentation/display format proves limiting in
practice. Since HTML became a standard a few years ago, there has been
essentially no innovation in HTML. A comparable XML presentation/display
standard would slow innovation around XML and create friction for users
of documents, depriving users of the rich data mining potential that XML
offers. Mandating a single standard XML document format would impact
users by restricting their definitions of business-specific data and
allowing them to only employ a small set of document editing features
for display and presentation that are supported by all rich authoring
tools. Tools vendors would not have an incentive to create innovative
document-related software that could help increase organizational
productivity through new display and presentation mechanisms because
users either couldn't take advantage of these new capabilities, or would
find it difficult to do so.
Imagine, for example, if the standard were set to
support only lowest common denominator features. This would sacrifice
value and richness for greater display and presentation uniformity but
the lack of opportunity for vendors to add their unique value might
diminish competition over time as well. On the other hand, an approach
based on using custom-defined schemas enables vendors to add competitive
features to their display and feature oriented XML format and enables
customers to use a mapping between their custom-defined schema and each
vendor full feature oriented product. Interoperability happens at the
custom-defined data level.
At the other end of the spectrum, standardizing on
a very rich XML file format that represents document display
characteristics can lead to poor user experiences as well. The process
of reading and writing a given XML file format should be fairly easy for
software vendors to do, allowing most to claim "compatibility" with a
file format standard for data interchange purposes. However, it would be
difficult for all the authoring programs to fully support all the
display and presentation features specified by a rich document file
format that went beyond the goal of data interoperability and exchange
and provide unique, innovative value to the customer - which could again
reduce competition and customer choice over time. One possible outcome
could be that users will have different and inadequate views of
documents depending upon which authoring tool they are using. For
example, users could be faced with document graphics that display at
different places in a document or in different sizes, reviewers'
comments or ether types of document notes that don't appear, or compound
document information that doesn't get assembled correctly when pulling
together sections from separate document files. Moreover, in the
authoring process, users are clever about finding a program feature that
enables them to achieve the desired structure or effect they need for
their document; however, they cannot be expected to know which features
of an authoring program might map back to approved file format
characteristics.
As originally envisioned, XML schemas were
intended not as the basis for a single, all-encompassing standard for
file formats. Indeed, XML's greatest promise is in enabling every
organization to define custom schema that best represent the data
that makes sense for their business. While this assumes a proliferation
of schemas, the fact that each schema is based on XML allows this to
occur without sacrificing interoperability.
In speaking with customers, Microsoft has learned
that for most organizations, capturing the meaning of their data using
custom-defined schemas is more important than having a common
display-oriented document format. With this in mind, most technology
companies today are focused on providing data interoperability.
However, this focus does not mean that software such as Microsoft Office
System 2003 will not provide effective mechanisms for enabling greater
display and presentation consistency as well. For example, a
standard XSLT could be created to always format the data from one agency
to have a common look for published materials, and in fact a different
XSLT could be used to format the data differently depending on the
display medium (paper, computer application screen, Web browser, PDA).
In the public sector, Microsoft and other
industry partners encourage government agencies and parliaments to
specify how their documents should appear by defining as many XSLT style
sheets as are needed, using any display characteristics available on the
market, and more importantly maintaining them, changing them and making
them evolve at will. This approach of enabling public sector agencies to
define exactly how public documents will be displayed is far better than
requiring all customers to generate documents in a static
display-oriented schema that cannot evolve except by the agreement of a
standards committee.
7. The Value of Promoting Multiple Technical Approaches for XML File
Formats
Another example of how a single format for all
documents ("one size fits all") does not help customers in establishing
display-oriented consistency comes when we analyze different approaches
in designing a display-oriented XML schema. The design of the Word ML
schema enables users to store in a single XML file the entire document,
including content, images, table of contents, etc. This enables any
person or program, for example, to send this single XML file to the wide
variety of XML-enabled backend systems, such as content management tools
and business workflow engines.
Sun Microsystem's StarOffice software product and
Word ML use different approaches to define an application-specific,
display-oriented document schema. The format in StarOffice uses many
different XML files (content, styles, settings, etc.) to collectively
define the document. It may prove difficult for individuals to manage a
collection of multiple files such as this, as files can easily get lost
or changed to render the initial document unrecoverable. It is also
difficult for XML developers to understand the rationale for why certain
document characteristics are placed in one XML file and other
characteristics elsewhere. There is no clearly accepted segmentation of
functional document features versus document formatting features.
All the files in a StarOffice document are
traditionally bundled together in a ZIP file. This makes it difficult
for the document information to be reused in the way that XML envisions
since initially a program needs to understand how to unpack the files
before it can access the XML. For example-, it should be possible to
transform an XML document to HTML to enable users that do not have the
XML authoring software to seamlessly view the document. However,
standard XML browsers cannot use simple XSLT to transform such a zipped
document to HTML because they must unzip the files first (which is not a
standard XML technique). Also standard Internet-based tools like
scripting engines often cannot run the executable code needed to unzip
files (e.g. on a secure server or locked desktop), but they can safely
parse a text file. XSLT transforms can easily be done with Word ML,
however, because Microsoft's approach toward XML (and the Office file
formats) is to maximize the flexibility and capability that
organizations have for accessing, reusing and sharing XML document
content.
Despite these apparent drawbacks, there are
undoubtedly reasons that Sun's development team pursued this design path
and believes it offers differentiation its customers are interested in.
Perhaps the very ability to separate elements of the document can speed
transmission by allowing only a necessary portion of a file to be sent
across a network. Or perhaps Sun's customers have prized the ability to
store files in a way that uses less memory. Regardless of the reasons,
the fact that Microsoft and Sun have pursued different paths has
increased customer choice without sacrificing data interoperability
between Sun and Microsoft products. While each approach involves pluses
and minuses, the ability to pursue different designs spurs innovation
and allows customers to choose products that best meet their needs. It
is this fact that best captures why a single mandated standard for XML
file formats would likely prove detrimental in
the long run because if such a
standard was in place, customers would not have the benefit of this
innovation and choice.
8. Conclusion
As organizations around the world begin to embrace the promise of
XML, there will be a significant need to engage in dialogue between the
technology industry, governments, parliaments, and the many
organizations that hope to deploy this technology,' While different
entities will ultimately choose different paths, the collective interest
in interoperability and innovation will require significant
collaboration. For its part, governments and parliaments have an
opportunity to create custom schema as one means to advance data
exchange interoperability. In addition, popular XML techniques based on
transformations (e.g. the W3C XSLT standard) enable richer document
display and data exchange interoperability, where necessary, between
public sector documents, authoring tools, back-ends and databases.
|