1. 引言

目前DevSecOps已经成为构建企业级研发安全的重要模式。静态扫描工具融入在DevSecOps的开发过程中，对提高产品的整体的安全水平发挥着重要的作用。为了获取安全检查能力覆盖的最大化，开发团队通常会引入多个安全扫描工具。但这也给开发人员和平台带来了更多的问题，为了降低各种分析工具的结果汇总到通用工作流程中的成本和复杂性, 业界开始采用静态分析结果交换格式(Static Analysis Results Interchange Format (SARIF))来解决这些问题。本篇是SARIF应用的入门篇和进阶篇中的进阶篇，将介绍SARIF在应用过程中对深层次需求的实现。对于SARIF的基础介绍，请参看《DevSecOps工具与平台间交互的桥梁–SARIF入门》。

2. SARIF 进阶

上次我们说了SARIF的一些基本应用，这里我们再来说下SARIF在更复杂的场景中的一些应用，这样才能为静态扫描工具提供一个完整的报告解决方案。

在业界著名的静态分析工具Coverity最新的2021.03版本中，新增的功能就包括: 支持在GitHub代码仓中以SARIF格式显示Coverity的扫描结果。可见Covreity也完成了SARIF格式的适配。

2.1. 元数据（metadata）的使用

为了避免扫描报告过大，对一些重复使用的信息，需要提取出来，做为元数据。例如：规则、规则的消息，扫描的内容等。

下面的例子中，将规则、规则信息在tool.driver.rules 中进行定义，在扫描结果(results)中直接使用规则编号ruleId来得到规则的信息，同时消息也采用了message.id的方式得到告警信息。这样可以避免规则产生同样告警的大量的重复信息，有效的缩小报告的大小。

vscode 中显示如下：

{
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "rules": [
            {
              "id": "CS0001",
              "messageStrings": {
                "default": {
                  "text": "This is the message text. It might be very long."
                }
              }
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CS0001",
          "ruleIndex": 0,
          "message": {
            "id": "default"
          }
        }
      ]
    }
  ]
}

2.2. 消息参数的使用

扫描结果的告警往往需要，根据具体的代码问题，在提示消息中给出具体的变量或函数的相关信息，便于用户对问题的理解。这个时候可以采用消息参数的方式，提供可变动缺陷消息。

下例中，对规则的消息中采用占位符的方式("{0}")提供信息模板，在扫描结果(results)中，通过arguments数组，提供对应的参数。在vscode中显示如下：

{
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "rules": [
            {
              "id": "CS0001",
              "messageStrings": {
                "default": {
                  "text": "Variable '{0}' was used without being initialized."
                }
              }
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CS0001",
          "ruleIndex": 0,
          "message": {
            "id": "default",
            "arguments": [
              "x"
            ]
          }
        }
      ]
    }
  ]
}

2.3. 消息中关联信息的使用

在有些时候，为了更好的说明这个告警的发生原因，需要给用户提供更多的参考信息，帮助他们理解问题。比如，给出这个变量的定义位置，污染源的引入点，或者其他辅助信息。

下例中，通过定义问题的发生位置(locations)的关联位置(relatedLocations)给出了，污染源的引入位置。在vscode中显示如下, 但用户点击“here”时，工具就可以跳转到变量expr引入的位置。

{
  "ruleId": "PY2335",
  "message": {
    "text": "Use of tainted variable 'expr' (which entered the system [here](1)) in the insecure function 'eval'."
  },
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 4
        }
      }
    }
  ],
  "relatedLocations": [
    {
      "id": 1,
      "message": {
        "text": "The tainted data entered the system here."
      },
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 3
        }
      }
    }
  ]
}

2.4. 缺陷分类信息的使用

缺陷的分类对于工具和扫描结果的分析是非常重要的。工具可以依托对缺陷的分类进行规则的管理，方便用户选取需要的规则；另一方面用户在查看分析报告时，也可以通过对缺陷的分类，快速对分析结果进行过滤。工具可以参考业界的标准，例如我们常用的Common Weakness Enumeration (CWE), 也可以自定义自己的分类，这些SARIF都提供了支持。

缺陷分类的例子

{
  "version": "2.1.0",
  "runs": [
    {
      "taxonomies": [
        {
          "name": "CWE",
          "version": "3.2",
          "releaseDateUtc": "2019-01-03",
          "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
          "informationUri": "https://cwe.mitre.org/data/published/cwe_v3.2.pdf/",
          "downloadUri": "https://cwe.mitre.org/data/xml/cwec_v3.2.xml.zip",
          "organization": "MITRE",
          "shortDescription": {
            "text": "The MITRE Common Weakness Enumeration"
          },
          "taxa": [
            {
              "id": "401",
              "guid": "10F28368-3A92-4396-A318-75B9743282F6",
              "name": "Memory Leak",
              "shortDescription": {
                "text": "Missing Release of Memory After Effective Lifetime"
              },
              "defaultConfiguration": {
                "level": "warning"
              }
            }
          ],
          "isComprehensive": false
        }
      ],
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "supportedTaxonomies": [
            {
              "name": "CWE",
              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82"
            }
          ],
          "rules": [
            {
              "id": "CA2101",
              "shortDescription": {
                "text": "Failed to release dynamic memory."
              },
              "relationships": [
                {
                  "target": {
                    "id": "401",
                    "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
                    "toolComponent": {
                      "name": "CWE",
                      "guid": "10F28368-3A92-4396-A318-75B9743282F6"
                    }
                  },
                  "kinds": [
                    "superset"
                  ]
                }
              ]
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CA2101",
          "message": {
            "text": "Memory allocated in variable 'p' was not released."
          },
          "taxa": [
            {
              "id": "401",
              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
              "toolComponent": {
                "name": "CWE",
                "guid": "10F28368-3A92-4396-A318-75B9743282F6"
              }
            }
          ]
        }
      ]
    }
  ]
}

2.4.1. 业界分类标准的引入（runs.taxonomies）

taxonomies 的定义

  "taxonomies": {
    "description": "An array of toolComponent objects relevant to a taxonomy in which results are categorized.",
    "type": "array",
    "minItems": 0,
    "uniqueItems": true,
    "default": [],
    "items": {
      "$ref": "#/definitions/toolComponent"
    }
  },

taxonomies节点是个数组节点，可以定义多个分类标准。同时taxonomies的定义参考定义组节点definitions下的toolComponent的定义。这与我们前面的工具扫描引擎(tool.driver)和工具扩展(tool.extensions)保持了一致. 这样设计的原因是引擎和结果的强相关性，可以通过这样的方法使之保持属性上的一致。

业界标准分类(standard taxonomy)的定义
例子中通过runs.taxonomies节点，声明了业界的分类标准CWE。在节点taxonomies中通过属性节点给出了该规范的描述，下面的只是样例，具体的参考SARIF的规范说明：
- name: 规范的名字;
- version: 版本;
- releaseDateUtc: 发布日期;
- guid: 唯一标识，便于其他地方引用此规范；
- informationUri: 规则的文档信息;
- downloadUri：下载地址；
- organization：发布组织
- shortDescription：规范的短描述。

2.4.2. 自定义分类的引入(runs.taxonomies.taxa)

taxa是个数组节点，为了缩小报告的尺寸，没有必要将所有自定义的分类信息都放在taxa节点下面，只需要列出和本次扫描相关的分类信息就够了。这也是为什么后面标识是否全面(isComprehensive)节点的默认值是false的原因。

例子中通过taxa节点引入了一个工具需要的分类：CWE-401 内存泄漏，并用guid 和id，做了这个分类的唯一标识，便于后面工具在规则或缺陷中引用这个标识。

2.4.3. 工具与业界分类标准关联(tool.driver.supportedTaxonomies)

工具对象通过tool.driver.supportedTaxonomies节点和定义的业界分类标准关联。supportedTaxonomies的数组元素是toolComponentReference对象，因为分类法taxonomies本身是toolComponent对象。 toolComponentReference.guid属性与run.taxonomies []中定义的分类法的对象的guid属性匹配。

例子中supportedTaxonomies.name:CWE, 它表示此工具支持CWE分类法，并用引用了taxonomies[0]中的guid：A9282C88-F1FE-4A01-8137-E8D2A037AB82，使之与业界分类标准CWE关联。

2.5. 规则与缺陷分类关联(rule.relationships)

规则是在tool.driver.rules节点下定义，rules是个数组节点，规则通过数组元素中的reportingDescriptor对象定义；
每个规则(ReportingDescriptor)中的relationships是个数组元素，每个元素都是一个reportingDescriptorRelationship对象，该对象建立了从该规则到另一个reportingDescriptor对象的关系。关系的目标可以是分类法中的分类单元（如本例中所示），也可以是另一个工具组件中的另一个规则；
关系(ReportingDescriptorRelationship)中的target属性标识关系的目标，它的值是一个reportingDescriptorReference对象，由此引用对象toolComponent中的reportingDescriptor；
reportingDescriptorReference对象中的toolComponent是一个toolComponentReference对象, 指向工具supportedTaxonomies中定义的分类。

下图为例子中的规则与缺陷分类的关联图：

2.5.1. 扫描结果中的分类(result.taxa)

在扫描结果(run.results)中, 每一个结果(result)下，有一个属性分类(taxa), taxa是一个数组元素，数组中的每个元素指向reportingDescriptorReference对象，用于指定该缺陷的分类。这个与规则对应分类的方式一样。从这一点也可以看出，我们可以省略result下的taxa，而是通过规则对应到缺陷的分类。

2.6. 代码流（Code Flow)

一些工具通过模拟程序的执行来检测问题，有时跨多个执行线程。 SARIF通过一组位置信息模拟执行过程，像代码流(Code Flow)一样。 SARIF代码流包含一个或多个线程流，每个线程流描述了单个执行线程上按时间顺序排列的代码位置。

2.6.1. 缺陷代码流组（result.codeFlows）

由于缺陷中，可能存在不止一个代码流，因此可选的result.codeFlows属性是一个数组形式的codeFlow对象。

   "result": {
      "description": "A result produced by an analysis tool.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        ... ...
        "codeFlows": {
          "description": "An array of 'codeFlow' objects relevant to the result.",
          "type": "array",
          "minItems": 0,
          "uniqueItems": false,
          "default": [],
          "items": {
            "$ref": "#/definitions/codeFlow"
          }
        },
      }
   }

2.6.2. 代码流的线程流组（codeFlow.threadFlows）

codeFlow的定义可以看到，每个代码流有，由一个线程组(threadFlows)构成，且线程组(threadFlows)是必须的。

    "codeFlow": {
      "description": "A set of threadFlows which together describe a pattern of code execution relevant to detecting a result.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        "message": {
          "description": "A message relevant to the code flow.",
          "$ref": "#/definitions/message"
        },

        "threadFlows": {
          "description": "An array of one or more unique threadFlow objects, each of which describes the progress of a program through a thread of execution.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlow"
          }
        },
      },

      "required": [ "threadFlows" ]
    },

2.6.3. 线程流（threadFlow）和线程流位置（threadFlowLocation）

在每个线程流(threadFlow)中，一个数组形式的位置组(locations)来描述工具对代码的分析过程。

线程流（threadFlow）定义：

    "threadFlow": {
      "description": "Describes a sequence of code locations that specify a path through a single thread of execution such as an operating system or fiber.",
      "type": "object",
      "additionalProperties": false,
      "properties": {

        "id": {
        ...

        "message": {
        ...  

        "initialState": {
        ...

        "immutableState": {
        ...

        "locations": {
          "description": "A temporally ordered array of 'threadFlowLocation' objects, each of which describes a location visited by the tool while producing the result.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlowLocation"
          }
        },

        "properties": {
        ...
      },

      "required": [ "locations" ]
    },

线程流位置（threadFlowLocation）定义：
位置组(locations)中的每个元素, 又是通过threadFlowLocation来表示工具的对代码位置的访问。最终通过location类型的location属性给出分析的位置信息。location可以包含物理和逻辑位置信息，因此codeFlow也可以用于二进制的分析流的表示。

在threadFlowLocation还有一个state属性的节点，我们可以通过它来存储变量、表达式的值或者符号表信息，或者用于状态机的表述。

    "threadFlowLocation": {
      "description": "A location visited by an analysis tool while simulating or monitoring the execution of a program.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        "index": {
          "description": "The index within the run threadFlowLocations array.",
        ...
        
        "location": {
          "description": "The code location.",
          "$ref": "#/definitions/location"
        },

        "state": {
          "description": "A dictionary, each of whose keys specifies a variable or expression, the associated value of which represents the variable or expression value. For an annotation of kind 'continuation', for example, this dictionary might hold the current assumed values of a set of global variables.",
          "type": "object",
          "additionalProperties": {
            "$ref": "#/definitions/multiformatMessageString"
          }
        },
        ...
      }
    },

2.6.4. 代码流样例

参考代码

1. # 3-Beyond-basics/bad-eval-with-code-flow.py
2.
3. print("Hello, world!")
4. expr = input("Expression> ")
5. use_input(expr)
6. 
7. def use_input(raw_input):
8.    print(eval(raw_input))

上面是一个python代码的代码注入的一个案例。

在第四行，输入信息赋值给变量expr；
在第五行，变量expr通过函数use_input的第一个参数，进入到函数use_input;
在第八行，通过函数print打印输入结果，但这里使用了函数eval()对输入参数进行了处理，由于参数在输入后，未经过检验，就直接用于函数eval的处理，这里可能会引入代码注入的安全问题。

这个分析过程可以通过下面的扫描结果表现出来，便于用户理解问题的发生过程。

扫描结果

{
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "PythonScanner"
        }
      },
      "results": [
        {
          "ruleId": "PY2335",
          "message": {
            "text": "Use of tainted variable 'raw_input' in the insecure function 'eval'."
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                },
                "region": {
                  "startLine": 8
                }
              }
            }
          ],
          "codeFlows": [
            {
              "message": {
                "text": "Tracing the path from user input to insecure usage."
              },
              "threadFlows": [
                {
                  "locations": [
                    {
                      "message": {
                        "text": "The tainted data enters the system here."
                      },
                      "location": {
                        "physicalLocation": {
                          "artifactLocation": {
                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                          },
                          "region": {
                            "startLine": 4
                          }
                        }
                      },
                      "state": {
                        "expr": {
                          "text": "42"
                        }
                      },
                      "nestingLevel": 0
                    },
                    {
                      "message": {
                        "text": "The tainted data is used insecurely here."
                      },
                      "location": {
                        "physicalLocation": {
                          "artifactLocation": {
                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                          },
                          "region": {
                            "startLine": 8
                          }
                        }
                      },
                      "state": {
                        "raw_input": {
                          "text": "42"
                        }
                      },
                      "nestingLevel": 1
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

这里只是一个简单的示例，通过SARIF的codeFLow，我们可以适应更加复杂的分析过程，从而让用户更好的理解问题，进而快速做出判断和修改。

2.7. 缺陷指纹（fingerprint）

在大型软件项目中，分析工具一次就可以产生成千上万个结果。为了处理如此多的结果，在缺陷管理上，我们需要记录现有缺陷，制定一个扫描基线，然后对现有问题进行处理。同时在后期的扫描中，需要将新的扫描结果与基线进行比较，以区分是否有新问题的引入。为了确定后续运行的结果在逻辑上是否与基线的结果相同，必须通过一种算法:使用缺陷结果中包含的特有信息来构造一个稳定的标识，我们将此标识称为指纹。使用这个指纹来标识这个缺陷的特征以区别于其他缺陷，我们也称这个指纹为这个缺陷的缺陷指纹。

缺陷指纹应该包含相对稳定不变的缺陷信息：

产生结果的工具的名称；
规则编号；
分析目标的文件系统路径；这个路径应该是工程本身具有的相对路径。不应该包含路径前面工程存放位置信息，因为每台机器存放工程的位置可能不同；
缺陷特征值（partialFingerprints）。

SARIF的每个扫描结果(result)中提供了一组这样的属性节点，用于缺陷指纹的存放，便于缺陷的管理系统通过这些标识，识别缺陷的唯一性。

    "result": {
      "description": "A result produced by an analysis tool.",
      "additionalProperties": false,
      "type": "object",
      "properties": {
        ... ...
        "guid": {
          "description": "A stable, unique identifier for the result in the form of a GUID.",
          "type": "string",
          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
        },

        "correlationGuid": {
          "description": "A stable, unique identifier for the equivalence class of logically identical results to which this result belongs, in the form of a GUID.",
          "type": "string",
          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
        },

        "occurrenceCount": {
          "description": "A positive integer specifying the number of times this logically unique result was observed in this run.",
          "type": "integer",
          "minimum": 1
        },

        "partialFingerprints": {
          "description": "A set of strings that contribute to the stable, unique identity of the result.",
          "type": "object",
          "additionalProperties": {
            "type": "string"
          }
        },

        "fingerprints": {
          "description": "A set of strings each of which individually defines a stable, unique identity for the result.",
          "type": "object",
          "additionalProperties": {
            "type": "string"
          }
        },
        ... ...
      }
    }

只通过缺陷的固有的信息特征，在某些情况下，不容易得到唯一识别结果的信息。这个时候我们需要增加一些与这个缺陷强相关的一些属性值，做为附加信息来加入到缺陷指纹的计算中，使最后的计算得到的指纹唯一。这个有些像我们做加密算法时的盐值，只是这个盐值需要保证生成的唯一值具有可重复性，以确保下次扫描时，对于同一缺陷能够得到相同的输入值，从而得到和上次一样的指纹。例如，工具在检查文档中是否存在敏感性的单词，告警信息为：“ xxx不应在文档中使用。”，这个时候就可以使用这个单词作为这个缺陷的一个特征值。

SARIF格式就提供了这样一个partialFingerprints属性，用于保存这个特征值，以允许SARIF生态系统中的分析工具和其他组件使用这个信息。缺陷管理系统可以将其附加到为每个结果构造的指纹中。前面的例子中，该工具就可以会将partialFingerprints对象中的属性的值设置为：禁止的单词。缺陷管理系统应该在其指纹计算中将信息包括在partialFingerprints中。

对于partialFingerprints，应该只添加和缺陷特征强相关的属性，而且属性的值应该相对稳定。比如，缺陷发生的代码行号就不适合加入到指纹的的逻辑运算中，因为代码行是一个会经常变动的值，在下次扫描的时候，很可能因为开发人员在问题行前添加或删除了一些代码行，而使同样的问题在新的扫描报告中得到不一样的代码行，从而影响缺陷指纹的计算值，导致比对时发生差异。

尽管我们试图为每个缺陷找到唯一的标识特征，还加入了一些可变的特征属性，但还是很难设计出一种算法来构造一个真正稳定的指纹结果。比如刚才的例子，如果同一个文件中存在几个同样的敏感字，我们这个时后还是无法为每一个告警缺陷给出一个唯一的标识。当然这个时候还可以加入函数名作为一个指纹的计算因子，因为函数名在一个程序中是相对稳定的存在，函数名的加入有助于区分同一个文件中同一个问题的出现范围，但还是会存在同一个函数内同样问题的多个相同缺陷。所以尽管我们尽量区分每一个告警，但缺陷指纹相同的场景在实际的扫描中还是会存在的。

幸运的是，出于实际目的，指纹并不一定要绝对稳定。它只需要足够稳定，就可以将错误报告为“新”的结果数量减少到足够低的水平，以使开发团队可以无需过多努力就可以管理错误报告的结果。

3. 总结

SARIF给出了静态扫描工具的标准输出的通用格式，能够满足静态扫描工具报告输出的各种要求；
对于各种静态扫描工具整合到DevSecOps平台，SARIF将降低扫描结果汇总到通用工作流程中的成本和复杂性；
SARIF也将为IDE整合各种扫描结果，提供统一的缺陷处理模块提供了可能；扫描结果在IDE中的缺陷展示、修复等，这样可以让工具的开发商专注于问题的发现，而减少对各种IDE的适配的工作量；
SARIF已经成为OASIS的标准之一，并被微软、GrammaTech等重要静态扫描工具厂商在工具中提供支持；同时U.S. DHS, U.S. NIST在一些静态检查工具的评估和比赛中，也要求提供扫描报告的格式采用SARIF；
SARIF虽然目前主要是为静态扫描工具的结果设计的，但由于其设计的通用性，一些动态分析工具厂商也给出了SARIF的成功应用。

4. Reference

Industry leaders collaborate to define SARIF interoperability standard for detecting software defects and vulnerabilities
OASIS Awards 2018 Open Standards Cup to KMIP for Key Management Security and SARIF for Static Analysis Tools
OASIS Static Analysis Results Interchange Format (SARIF) Technical Committee
SARIF Specification
SARIF Tutorials
Vscode Extension: Sarif Viewer
SARIF-SDK
Fortify FPR to SARIF
GrammaTech SARIF integration for GitHub
Static Analysis Results: A Format and a Protocol: SARIF & SASP
浅谈 language server & LSIF & SARIF & Babelfish & Semantic & Tree-sitter & Kythe & Glean等

合智互联客户成功服务热线：400-1565-661

DevSecOps工具与平台交互的桥梁 -- SARIF进阶