关于OFFICE 2010 文件验证的相关知识

lrlxxqxa · 发表于 2010-11-18 14:34

   Office 安全团队负责测试 Office 文件验证（代码名称：网关守卫）。人们对 Microsoft Office 2010 的新文件验证功能存在一些误解，在此我希望消除这些误解，同时说明原因及具体操作。

一、什么是Office 2010 文件验证？

   Office文件验证是一种防范功能。主要是检查应用程序即解析文件的具体位数，如果确定是有效文件则会按正常方式打开，如果发现文件无效则会发送到（默认情况）受保护的视图。

二、什么是网关守卫？

   Office 文件验证是一种功能，该功能最初是在 Publisher 2007 中引入的，用来验证 Publisher 的PUB 文件。它可验证特定二进制文件是否符合应用程序的要求。在 Office 2010 中有效扩展了该功能，以包含 Word、Excel 和 PowerPoint 的二进制格式。请注意，该功能仅适用于二进制格式（即PUB、DOC、XLS、PPT 等），而不能验证基于 XML 的文档（即 DOCX、XLSX、PPT等），也不能验证宏或其他自定义项目。它可验证的对象是文件结构。

三、为什么要进行验证？

   一直以来，Office 二进制格式的标准不断演化，复杂性不断增加。对于这些格式复杂的原因，在其他地方已详细讨论过（请参阅2、3楼的 Joel Spolsky 的文章），因此在此不作探讨，但此处完好记录了这些二进制格式。恶意攻击者将二进制文件用作攻击手段来感染目标用户，因此我们希望找到一种阻止此问题发生的方法。网关守卫采取的一项措施是：当 Microsoft 收到一个新的 Office 文件格式攻击的报告时，就会通过验证进行检查，以了解这项措施的效果。到目前为止，效果相当不错！

四、如何控制该验证？

1、通过策略
   利用组策略来控制文件验证失败时的默认操作。这些策略位于组策略模板中应用程序的“选项”\“安全性”\“信任中心”\“受保护的视图”下，它视应用程序设置而定。

2、通过注册表
   有以下几个用来控制 Office 文件验证的各个方面的注册表项。

常用注册表项
HKCU\Software\Microsoft\Office\14.0\Common\Security\FileValidation \ReportingInterval
- 这是一个 DWORD，它可控制显示将文件发送到 Windows 错误报告的对话框的间隔天数。
HKCU\Software\Microsoft\Office\14.0\Common\Security\FileValidation\DisableReporting
- 这是一个 DWORD，如果设置为 1，则禁止显示对话框（因此禁止向 Windows 错误报告发送文件）。

应用程序特定注册表项
适用于我将使用“Excel”的这些示例，但对“PowerPoint”和“Word”同样有效
HKCU\Software\Microsoft\Office\14.0\Excel\Security\FileValidation\EnableOnLoad
– 这是一个 DWORD，如果设置为 0，则 Office 不会验证文件。
HKCU\Software\Microsoft\Office\14.0\Excel\Security\FileValidation\DisableEditFromPV
– 这是一个 DWORD，如果设置为 1，则禁止编辑验证失败的文件。

Excel 特定注册表项
HKCU\Software\Microsoft\Office\14.0\Excel\Security\FileValidation\PivotOptions
– 这是一个 DWORD，可控制验证文件的透视表缓存（由于性能原因）的特定选项。

0 = 从不验证任何透视表缓存

1 = 在以下情况下验证透视表缓存：
(1) 文件是从 Internet 打开的，并且该平台在本地将文件标记为来自 Internet。
(2) 文件是 Microsoft Outlook 电子邮件附件。
(3) 用户专门在受保护的视图中打开文件。
(4) 从缓存 Internet 内容的本地已知“非安全位置”以及任何特殊用户定义的不受信任位置打开文件，除非通过（其他）注册表项禁用受保护的视图的非安全位置。
(5) 在加载时打开文件并解析透视表缓存。

2 = 始终验证所有透视表缓存

五、Office 2010 文件验证的时间：

   有了这项防范功能，打开一个文件需要更长的时间，但我们通常只需多花1-100毫秒便可完成验证。当然，如果文件很大非常复杂，那么确实需要超过一秒的时间才能完成验证。

对于验证时间很长的任何文件（如果文件通过验证、未通过验证或跳过验证），系统会显示与有问题的文件相同的 Windows 错误报告提示；向用户提供选项，以便将文件发送给微软进行进一步分析。

lrlxxqxa · 发表于 2010-11-18 14:41

引自Joel on Software
http://www.joelonsoftware.com/items/2008/02/19.html

Why are the Microsoft Office file formats so complicated? (And some workarounds)
by Joel Spolsky

Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file. But wait, that’s not all there is to it! This document includes the following interesting comment:

Each Excel workbook is stored in a compound file.
You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file systems inside a single file. These are sufficiently complicated that you have to read another 9 page spec to figure that out. And these “specs” look more like C data structures than what we traditionally think of as a spec. It's a whole hierarchical file system.

If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. A normal programmer would conclude that Office’s binary file formats:

are deliberately obfuscated
are the product of a demented Borg mind
were created by insanely bad programmers
and are impossible to read or create correctly.
You’d be wrong on all four counts. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.

The first thing to understand is that the binary file formats were designed with very different design goals than, say, HTML.

They were designed to be fast on very old computers. For the early versions of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386 at 20 MHz had to be able to run Excel comfortably. There are a lot of optimizations in the file formats that are intended to make opening and saving files much faster:

These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.
The file format is contorted, where necessary, to make common operations fast. For example, Excel 95 and 97 have something called “** Save” which they use sometimes as a faster variation on the OLE compound document format, which just wasn’t fast enough for mainstream use. Word had something called Fast Save. To save a long document quickly, 14 out of 15 times, only the changes are appended to the end of the file, instead of rewriting the whole document from scratch. On the hard drives of the day, this meant saving a long document took one second instead of thirty. (It also meant that deleted data in a document was still in the file. This turned out to be not what people wanted.)
They were designed to use libraries. If you wanted to write a from-scratch binary importer, you’d have to support things like the Windows Metafile Format (for drawing things) and OLE Compound Storage. If you’re running on Windows, there’s library support for these that makes it trivial... using these features was a shortcut for the Microsoft team. But if you’re writing everything on your own from scratch, you have to do all that work yourself.

Office has extensive support for compound documents, for example, you can embed a spreadsheet in a Word document. A perfect Word file format parser would also have to be able to do something intelligent with the embedded spreadsheet.

They were not designed with interoperability in mind. The assumption, and a fairly reasonable one at the time, was that the Word file format only had to be read and written by Word. That means that whenever a programmer on the Word team had to make a decision about how to change the file format, the only thing they cared about was (a) what was fast and (b) what took the fewest lines of code in the Word code base. The idea of things like SGML and HTML—interchangeable, standardized file formats—didn’t really take hold until the Internet made it practical to interchange documents in the first place; this was a decade later than the Office binary formats were first invented. There was always an assumption that you could use importers and exporters to exchange documents. In fact Word does have a format designed for easy interchange, called RTF, which has been there almost since the beginning. It’s still 100% supported.

They have to reflect all the complexity of the applications. Every checkbox, every formatting option, and every feature in Microsoft Office has to be represented in file formats somewhere. That checkbox in Word’s paragraph menu called “Keep With Next” that causes a paragraph to be moved to the next page if necessary so that it’s on the same page as the paragraph after it? That has to be in the file format. And that means if you want to implement a perfect Word clone than can correctly read Word documents, you have to implement that feature. If you’re creating a competitive word processor that has to load Word documents, it may only take you a minute to write the code to load that bit from the file format, but it might take you weeks to change your page layout algorithm to accommodate it. If you don’t, customers will open their Word files in your clone and all the pages will be messed up.

They have to reflect the history of the applications. A lot of the complexities in these file formats reflect features that are old, complicated, unloved, and rarely used. They’re still in the file format for backwards compatibility, and because it doesn’t cost anything for Microsoft to leave the code around. But if you really want to do a thorough and complete job of parsing and writing these file formats, you have to redo all that work that some intern did at Microsoft 15 years ago. The bottom line is that there are thousands of developer years of work that went into the current versions of Word and Excel, and if you really want to clone those applications completely, you’re going to have to do thousands of years of work. A file format is just a concise summary of all the features an application supports.

Just for kicks, let’s look at one tiny example in depth. An Excel worksheet is a bunch of BIFF records of different types. I want to look at the very first BIFF record in the spec. It’s a record called 1904.

The Excel file format specification is remarkably obscure about this. It just says that the 1904 record indicates “if the 1904 date system is used.” Ah. A classic piece of useless specification. If you were a developer working with the Excel file format, and you found this in the file format specification, you might be justified in concluding that Microsoft is hiding something. This piece of information does not give you enough information. You also need some outside knowledge, which I’ll fill you in on now. There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904. Excel supports both because the first version of Excel, for the Mac, just used that operating system’s epoch because that was easy, but Excel for Windows had to be able to import 1-2-3 files, which used 1/1/1900 for the epoch. It’s enough to bring you to tears. At no point in history did a programmer ever not do the right thing, but there you have it.

Both 1900 and 1904 file types are commonly found in the wild, usually depending on whether the file originated on Windows or Mac. Converting from one to another silently can cause data integrity errors, so Excel won’t change the file type for you. To parse Excel files you have to handle both. That’s not just a matter of loading this bit from the file. It means you have to rewrite all of your date display and parsing code to handle both epochs. That would take several days to implement, I think.

Indeed, as you work on your Excel clone, you'll discover all kinds of subtle details about date handling. When does Excel convert numbers to dates? How does the formatting work? Why is 1/31 interpreted as January 31 of this year, while 1/50 is interpreted as January 1st, 1950? All of these subtle bits of behavior cannot be fully documented without writing a document that has the same amount of information as the Excel source code.

And this is only the first of hundreds of BIFF records you have to handle, and one of the **st. Most of them are complicated enough to reduce a grown programmer to tears.

The only possible conclusion is this. It's very helpful of Microsoft to release the file formats for Microsoft and Office, but it's not really going to make it any easier to import or save to the Office file formats. These are insanely complex and rich applications, and you can’t just implement the most popular 20% and expect 80% of the people to be happy. The binary file specification is, at most, going to save you a few minutes reverse engineering a remarkably complex system.

OK, I promised some workarounds. The good news is that for almost all common applications, trying to read or write the Office binary file formats is the wrong decision. There are two major alternatives you should seriously consider: letting Office do the work, or using file formats that are easier to write.

Let Office do the heavy work for you. Word and Excel have extremely complete object models, available via COM Automation, which allow you to programmatically do anything. In many situations, you are better off reusing the code inside Office rather than trying to reimplement it. Here are a few examples.

You have a web-based application that’s needs to output existing Word files in PDF format. Here’s how I would implement that: a few lines of Word VBA code loads a file and saves it as a PDF using the built in PDF exporter in Word 2007. You can call this code directly, even from ASP or ASP.NET code running under IIS. It’ll work. The first time you launch Word it’ll take a few seconds. The second time, Word will be kept in memory by the COM subsystem for a few minutes in case you need it again. It’s fast enough for a reasonable web-based application.
Same as above, but your web hosting environment is Linux. Buy one Windows 2003 server, install a fully licensed copy of Word on it, and build a little web service that does the work. Half a day of work with C# and ASP.NET.
Same as above, but you need to scale. Throw a load balancer in front of any number of boxes that you built in step 2. No code required.
This kind of approach would work for all kinds of common Office types of applications you might perform on your server. For example:

Opening an Excel workbook, storing some data in input cells, recalculating, and pulling some results out of output cells
Using Excel to generate charts in GIF format
Pulling just about any kind of information out of any kind of Excel worksheet without spending a minute thinking about file formats
Converting Excel file formats to CSV tabular data (another approach is to use Excel ODBC drivers to ** data out using SQL queries).
Editing Word documents
Filling out Word forms
Converting files between any of the many file formats supported by Office (there are importers for dozens of word processor and spreadsheet formats)
In all of these cases, there are ways to tell the Office objects that they’re not running interactively, so they shouldn’t bother updating the screen and they shouldn’t prompt for user input. By the way, if you go this route, there are a few gotchas, and it's not officially supported by Microsoft, so read their knowledge base article before you get started.

Use a **r format for writing files. If you merely have to produce Office documents programmatically, there’s almost always a better format than the Office binary formats that you can use which Word and Excel will open happily, without missing a beat.

If you simply have to produce tabular data for use in Excel, consider CSV.
If you really need worksheet calculation features that CSV doesn’t support, the WK1 format (Lotus 1-2-3) is a heck of a lot **r than Excel, and Excel will open it fine.
If you really, really have to generate native Excel files, find an extremely old version of Excel… Excel 3.0 is a good choice, before all the compound document stuff, and save a minimum file containing only the exact features you want to use. Use this file to see the exact minimum BIFF records that you have to output and just focus on that part of the spec.
For Word documents, consider writing HTML. Word will open those fine, too.
If you really want to generate fancy formatted Word documents, your best bet is to create an RTF document. Everything that Word can do can be expressed in RTF, but it’s a text format, not binary, so you can change things in the RTF document and it’ll still work. You can create a nicely formatted document with placeholders in Word, save as RTF, and then using ** text substitution, replace the placeholders on the fly. Now you have an RTF document that every version of Word will open happily.
Anyway, unless you’re literally trying to create a competitor to Office that can read and write all Office files perfectly, in which case, you’ve got thousands of years of work cut out for you, chances are that reading or writing the Office binary formats is the most labor intensive way to solve whatever problem it is that you’re trying to solve.

lrlxxqxa · 发表于 2010-11-18 14:42

引自MSDN Library
http://msdn.microsoft.com/en-us/library/cc313118.aspx

The Microsoft Office file formats documentation provides detailed technical specifications for Microsoft Office file formats, including the binary file formats as well as the standards-based file formats, ODF and Open XML (ECMA 376 and IS 29500). The binary file format documentation is designed to describe each such file format in detail as it is used by these applications to create documents. The ODF and Open XML documentation is designed to describe how Microsoft has implemented those standards in Microsoft Word, Microsoft Excel, and Microsoft PowerPoint. It is intended for use in conjunction with the publicly available standard specifications, computer language design and implementation art. It assumes that the reader either is familiar with this material or has access to it. The ODF and Open XML documentation provides the following levels of audience support:

Kevin · 发表于 2010-12-6 10:56

综合评分：3分

dawin · 发表于 2011-1-11 13:03

小锐你太强大了。。E文也那么好啊。连周总都过来捧场。
不过我看不懂这个文章。

kiyukawa · 发表于 2011-2-8 21:14

实在是太强大，英文版实在是复杂！！

liuyang1985116 · 发表于 2011-7-29 20:40

学习一下，虽然不太懂

lingzi11 · 发表于 2011-10-10 23:49

真够牛的哪

金华秀儿 · 发表于 2011-10-23 18:17

果然很强大啊。

		自动登录	找回密码
密码			免费注册