CodeQL 学习 | Drunkbaby's Blog

codeql 学习记录

大部分内容参考自淚笑师傅的博客
https://l3yx.github.io/2022/03/05/CodeQL%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0

个人在学习完之后再回过头来看，感觉上是不应该先看这个内容，这个内容还是有点偏入门与进阶之间的难度。从入门角度来说可以先看这篇文章 CodeQL从入门到放弃

我个人也会在后续写一些关于 CodeQL 入门类的文章

0x01 环境/安装

下载 CodeQL CLI

https://github.com/github/codeql-cli-binaries/releases

我这里把 /codeql/codeql 的名字改成了 codeql/codeql-cli，添加系统环境变量

接着在 Terminal 中输入 codeql 命令验证是否编辑成功

下载包含标准库的工作空间

cd /xxx/CodeQL/
git clone https://github.com/github/vscode-codeql-starter.git
cd vscode-codeql-starter
git submodule update --init --remote
git submodule update --remote	#定期执行以更新子模块

安装 VSCode CodeQL 扩展

VSCode 商店搜索安装 CodeQL，并在扩展设置中设置 CodeQL 引擎路径 /xxx/CodeQL/codeql-cli/codeql

0x02 运行 CodeQL

运行 CodeQL 之前需要先构建数据库，这个数据库本质上其实就是源代码的 Target 打包了一下，对应的命令

1	codeql database create E:\Coding\CodeQL\CodeQLearning\javaCodeQLTest --language="java" --source-root=E:\Coding\JavaSec\Eliauk --command="mvn clean package -Dmaven.test.skip=true"

/xxx/CodeQL/databases/Test 指定生成的数据库位置
–source-root 项目源码路径
–command 编译命令，PHP 和 Python 等不需要。对于 Maven，Ant 等项目也可以省略

如果没有指定--command，CodeQL 会根据平台的不同，调用./java/tools/autobuild.cmd或./java/tools/autobuild.sh 对项目进行分析。如果该项目的编译工具为 Gradle、Maven 或 Ant，且能找到相应的配置文件。程序就会进入相应的流程，调用相关的编译指令对项目进行编译。CodeQL 会收集项目编译过程中产生的信息，并以此生成数据库。如果不属于 Gradle、Maven、Ant 中任意一种，则报错退出。

然后用 vscode 中的 QL 部分打开此文件夹

再对这个数据库进行 CodeQL 语句的查询，具体做法如下

查询结果

0x03 一些示例

Java 污点跟踪

CodeQL 中 Java 污点跟踪原理

全局污点跟踪分析要继承 TaintTracking::Configuration 这个类，然后重载 isSource 和isSink 方法

class VulConfig extends TaintTracking::Configuration {
VulConfig() { this = "myConfig" }

override predicate isSource(DataFlow::Node source) {

}

override predicate isSink(DataFlow::Node sink) {

}
}

from VulConfig config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "source are"

以 GetenvSource-URLSink 为例

/**  
 * @kind path-problem  
 */  
  
import java  
import semmle.code.java.dataflow.TaintTracking  
import DataFlow::PathGraph  
  
class GetenvSource extends DataFlow::ExprNode {  
  GetenvSource() {  
    exists(Method m | m = this.asExpr().(MethodAccess).getMethod() |  
      m.hasName("getenv") and  
      m.getDeclaringType() instanceof TypeSystem  
    )  
  }  
}  
  
class URLSink extends DataFlow::ExprNode {  
  URLSink() {  
    exists(Call call |  
      this.asExpr() = call.getArgument(0) and  
      call.getCallee().(Constructor).getDeclaringType().hasQualifiedName("java.net", "URL")  
    )  
  }  
}  
  
class GetenvToURLTaintTrackingConfig extends TaintTracking::Configuration {  
  GetenvToURLTaintTrackingConfig() { this = "GetenvToURLTaintTrackingConfig" }  
  
  override predicate isSource(DataFlow::Node source) { source instanceof GetenvSource }  
  
  override predicate isSink(DataFlow::Node sink) { sink instanceof URLSink }  
}  
  
from GetenvToURLTaintTrackingConfig cfg, DataFlow::PathNode source, DataFlow::PathNode sink  
where cfg.hasFlowPath(source, sink)  
select sink, source, sink, "-"

Python污点跟踪

以 RemoteFlowSource-FileSystemAccessSink 为例

/**  
 * @kind path-problem  
 */  
  
import python  
import semmle.python.dataflow.new.DataFlow  
import semmle.python.dataflow.new.TaintTracking  
import semmle.python.dataflow.new.RemoteFlowSources  
import semmle.python.Concepts  
import DataFlow::PathGraph  
  
class RemoteToFileConfiguration extends TaintTracking::Configuration {  
  RemoteToFileConfiguration() { this = "RemoteToFileConfiguration" }  
  
  override predicate isSource(DataFlow::Node source) { source instanceof RemoteFlowSource }  
  
  override predicate isSink(DataFlow::Node sink) {  
    sink = any(FileSystemAccess fa).getAPathArgument()  
  }  
}  
  
from RemoteToFileConfiguration cfg, DataFlow::PathNode source, DataFlow::PathNode sink  
where cfg.hasFlowPath(source, sink)  
select sink, source, sink, "-"

0x04 CodeQL For Java

CodeQL 元数据

CodeQL查询的元数据作为 QLDoc 注释的内容包含在每个查询文件的顶部。此元数据告诉 LGTM 和VSCode 的 CodeQL 插件如何处理查询并正确显示其结果。

例如：

/**  
* @name Empty block  
* @kind problem  
* @problem.severity warning  
* @id java/example/empty-block  
*/  
  
import java  
  
from BlockStmt b  
where b.getNumStmt() = 0  
select b, "This is an empty block."

1
2
3

/**  
 * @kind path-problem  
 */

Java 代码的基础查询

以下查询为查找多余的 if 语句，即 then 分支是空的，如if (...) { }

import java  
  
from IfStmt ifstmt, BlockStmt blockstmt  
where ifstmt.getThen() = blockstmt and  
blockstmt.getNumStmt() = 0  
select ifstmt, "This 'if' statement is redundant."

我们一步步拆解一下语法，首先是 import java

import java
导入适用于 Java 的标准 CodeQL 库，每个查询都以一个或多个 import 语句开始

然从定义查询的变量里面取值

from IfStmt ifstmt, BlockStmt blockstmt
定义查询的变量，声明的形式为： <type> <variable name>
IfStmt：if语句
BlockStmt：语句块

1	where ifstmt.getThen() = blockstmt and blockstmt.getNumStmt() = 0

定义变量的条件，ifstmt.getThen() = blockstmt 将这两个变量联系起来。blockstmt 必须是 if 语句的 then 分支。
blockstmt.getNumStmt() = 0 声明该块必须为空（即不包含任何语句）
IfStmt::getThen：Stmt getThen()，成员谓词，获取此 if 语句的 then 分支
BlockStmt::getNumStmt：int getNumStmt()，成员谓词，获取此块中直接子语句的数目

1	select ifstmt, “This ‘if’ statement is redundant.”

定义每个匹配项的报告内容，select 用于查找不良编码实例的查询语句始终采用以下形式： select <program element>, "<alert message>"

浏览查询结果可能会发现带有else分支的if语句的例子，其中空的then分支确实起到了作用。例如：

if (...) {  
  ...  
} else if ("-verbose".equals(option)) {  
  // nothing to do - handled earlier  
} else {  
  error("unrecognized option");  
}

运行 CodeQL 的查询结果如图

在这种情况下，将带有空 then 分支的 if 语句识别为多余的是错误的。一种解决方案是如果 if 语句有 else 分支，则忽略空的 then 分支：

import java  
  
from IfStmt ifstmt, BlockStmt blockstmt  
where ifstmt.getThen() = blockstmt and  
blockstmt.getNumStmt() = 0 and  
not exists(ifstmt.getElse())  
select ifstmt, "This 'if' statement is redundant."

IfStmt::getElse：Stmt getElse()，成员谓词，获取此if语句的else分支

CodeQL 的 Java 库

标准 Java 库中最重要的类可以分为以下五个类别

1、表示程序元素的类（例如 Java 的类和方法）
2、表示 AST 节点的类（例如语句和表达式）
3、表示元数据的类（例如注释和注解）
4、计算度量的类（例如圈复杂度和耦合度）
5、导航程序调用图的类

程序元素

包括包（Package）、编译单元（CompilationUnit）、类型（Type）、方法（Method）、构造函数（Constructor）和变量（Variable）
它们的共同超类是Element，它提供了通用的成员谓词，用于确定程序元素的名称和检查两个元素是否相互嵌套

Callable 是 Method 和Constructor的共同超类，通过 Callable 引用一个可能是方法或构造函数的元素通常很方便

类型

Type 类有许多子类用于表示不同类型：

PrimitiveType 表示一个基本类型，即 boolean, byte, char, double, float, int, long, short 之一， QL 也将 void 和 <nulltype> 归为基本类型
RefType表示引用类型，包含如下子类：
- Class Java 类
- Interface Java 接口
- EnumType Java 枚举类型
- Array Java 数组类型

例如，以下查询查找程序中为 int 类型的所有变量：

import java  
  
from Variable v, PrimitiveType pt  
where pt = v.getType() and   
    pt.hasName("int")  
select v

Variable::getType：Type getType()，获取变量的类型
Element::hasName：predicate hasName(string name)，如果元素具有指定的名称则该谓词成立

引用类型也根据其声明范围进行分类：

TopLevelType 表示在编译单元的顶层声明的引用类型
NestedType 是在另一个类型中声明的类型

例如，此查询查找名称与其编译单元名称不同的所有顶级类型：

import java  
  
from TopLevelType tl  
where tl.getName() != tl.getCompilationUnit().getName()  
select tl

Element::getName：string getName()，获取元素的名称
RefType::getCompilationUnit：CompilationUnit getCompilationUnit()，获取声明此类型的编译单元
CompilationUnit::getName：string getName()，获取编译单元的名称（不包括其扩展名）

还有几个专用的类：

TopLevelClass 表示在编译单元的顶层声明的类
NestedClass 表示在另一个类型内声明的类，如
- LocalClass，是在方法或构造函数中声明的类.
- AnonymousClass，匿名类

最后，该库还有许多封装了常用的 Java 标准库类的单例类：

TypeObject、TypeCloneable、TypeRuntime、TypeSerializable、TypeString、TypeSystem 和 TypeClass

例如，我们可以编写一个查询，查找直接继承 Object 的所有嵌套类：

import java  
  
from NestedClass nc  
where nc.getASupertype() instanceof TypeObject  
select nc

RefType::getASupertype：RefType getASupertype()，获取此类型的直接超类

泛型

Type 还有几个子类用于处理泛型类型
GenericType 代表 GenericInterface 或 GenericClass，它表示一个泛型类型声明，比如 java.util.Map 接口

package java.util.;  
  
public interface Map<K, V> {  
    int size();  
  
    // ...  
}

类型参数，如本例中的 K 和 V，由类 TypeVariable 表示

泛型类型的参数化实例提供了一个具体类型来实例化类型参数，如 Map<String, File> 中所示。这样的类型由 ParameterizedType 表示，它不同于表示其实例化来源的泛型类型 GenericType。要从 ParameteredType 转换为相应的 GenericType，可以使用谓词 getSourceDeclaration。

例如，我们可以使用下面的查询来查找所有 java.util.Map 的参数化实例：

import java  
  
from GenericInterface map, ParameterizedType pt  
where map.hasQualifiedName("java.util", "Map")   
    and pt.getSourceDeclaration() = map  
select pt

变量

类 Variable 表示 Java意义上的变量，它要么是类的成员字段（无论是静态的还是非静态的），要么是局部变量，要么是参数。所以针对这些特殊情况，有三个子类：

Field 表示一个Java字段
LocalVariableDecl 表示局部变量
Parameter 表示方法或构造函数的参数

抽象语法树

此类别中的类表示抽象语法树（AST）节点，即语句（类Stmt）和表达式（类Expr）。有关标准QL库中可用的表达式和语句类型的完整列表，见用于处理Java程序的抽象语法树类

Expr和Stmt都提供了成员谓词，用于探索程序的抽象语法树：

Expr.getAChildExpr 返回给定表达式的子表达式
Stmt.getAChild 返回直接嵌套在给定语句中的语句或表达式
Expr.getParent 和 Stmt.getParent 返回AST节点的父节点

例如，以下查询将查找所有父级为return语句的表达式：

import java  
  
from Expr e  
where e.getParent() instanceof ReturnStmt  
select e

以下查询查找父级为 if 语句的语句（将查找程序中所有 if 语句的 then 分支和 else 分支）：

import java  
  
from Stmt s  
where s.getParent() instanceof IfStmt  
select s

最后，这是一个查找方法体的查询：

正如这些示例所示，表达式的父节点并不总是表达式：它也可能是语句，例如 IfStmt。类似地，语句的父节点并不总是一个语句：它也可能是一个方法或构造函数。为了解决这个问题，QL Java 库提供了两个抽象类 ExprParent 和 StmtParent，前者表示可能是表达式父节点的任何节点，后者表示可能是语句父节点的任何节点

有关使用 AST 类的更多信息，见 Java中容易溢出的比较运算

元数据

除了程序代码之外，Java 程序还有几种元数据。特别是有注解和Javadoc注释。由于此元数据对于增强代码分析和作为分析主题本身都很有趣，因此 QL 库定义了用于访问它的类

对于注解，类 Annotatable 是所有可注解的程序元素的超类。包括包、引用类型、字段、方法、构造函数和局部变量声明。对于每一个这样的元素，其谓词 getAnAnnotation 能检索该元素可能具有的任何注解。例如，以下查询查找构造函数上的所有注解：

import java  
  
from Constructor c  
select c.getAnAnnotation()

这些注解由类 Annotation 表示。注解只是类型为 AnnotationType 的表达式。例如，可以修改此查询，使其只报告 Deprecated 的构造函数

import java  
  
from Constructor c, Annotation ann, AnnotationType anntp  
where ann = c.getAnAnnotation() and  
    anntp = ann.getType() and  
    anntp.hasQualifiedName("java.lang", "Deprecated")  
select ann

有关使用注解的更多信息，见本文 Java 中的注解

对于 Javadoc，类 Element 有一个成员谓词 getDoc，它返回一个委派的 Documentable 的对象，然后可以查询它附加的 Javadoc 注释。例如，以下查询在私有字段上查找 Javadoc 注释：

import java

from Field f, Javadoc jdoc
where f.isPrivate() and
    jdoc = f.getDoc().getJavadoc()
select jdoc

类 Javadoc 将整个 Javadoc 注释表示为 JavadocElement 节点树，可以使用成员谓词 getAChild 和 getParent 遍历这些节点。例如，你可以编辑查询，以便在私有字段的 Javadoc 注释中找到所有 @author 标签：

import java  
  
from Field f, Javadoc jdoc, AuthorTag at  
where f.isPrivate() and  
    jdoc = f.getDoc().getJavadoc() and  
    at.getParent+() = jdoc  
select at

Recursion — CodeQL

有关使用 Javadoc 的更多信息，见 Javadoc

度量

标准的 QL Java 库为计算 Java 程序元素的度量提供了广泛的支持。为了避免与度量计算相关的成员谓词过多而给代表这些元素的类造成过重的负担，这些谓词被放在委托类上

总共有六个这样的类：MetricElement、MetricPackage、MetricRefType、MetricField、MetricCallable 和 MetricStmt。相应的元素类各自提供一个成员谓词 getMetrics，可用于获取委托类的实例，然后在这个实例上进行度量计算。例如，以下查询查找圈复杂度大于 5 的方法

import java

from Method m, MetricCallable mc
where mc = m.getMetrics() and
    mc.getCyclomaticComplexity() > 5
select m

调用图

从 Java 代码生成的 CodeQL 数据库包含有关程序调用图的预计算信息，即给定调用在运行时可以分派给哪些方法或构造函数。

前文介绍的 Callable 类，它包括方法，也包括构造函数。调用表达式是使用类Call来进行抽象的，它包括方法调用、new 表达式和使用 this 或 super 的显式构造函数调用

我们可以使用谓词 Call.getCallee 来查找一个特定的调用表达式所指向的方法或构造函数。例如，以下查询查找名为 println 的方法的所有调用：

import java  
  
from Call c, Method m  
where m = c.getCallee() and  
    m.hasName("println")  
select c

相反， Callable.getAReference 返回指向它的 Call 。所以我们可以使用这个查询找到从未被调用的方法或构造函数：

import java

from Callable c
where not exists(c.getAReference())
select c

有关可调用项和调用的更多信息，见导航调用图

0x05 Java 中的数据流分析

数据流分析用于计算一个变量在程序中各个点上可能保持的值，确定这些值如何在程序中传播以及它们的使用位置

局部数据流

局部数据流是单个方法内或可调用内的数据流。局部数据流通常比全局数据流更容易、更快、更精确，并且对于许多查询来说已经足够了

使用局部数据流

局部数据流库位于 DataFlow 模块中，该模块定义了类 Node 来表示数据可以通过的任意元素。Node 分为表达式节点（ExprNode）和参数节点（ParameterNode）。可以使用成员谓词 asExpr 和 asParameter 在数据流节点和表达式/参数之间映射：

class Node {  
  /** Gets the expression corresponding to this node, if any. */  
  Expr asExpr() { ... }  
  
  /** Gets the parameter corresponding to this node, if any. */  
  Parameter asParameter() { ... }  
  
  ...  
}

或者使用谓词 exprNode 和 parameterNode

/**  
 * Gets the node corresponding to expression `e`.  
 */  
ExprNode exprNode(Expr e) { ... }  
  
/**  
 * Gets the node corresponding to the value of parameter `p` at function entry.  
 */  
ParameterNode parameterNode(Parameter p) { ... }

如果存在一条从节点 nodeFrom 到节点 nodeTo 的实时数据流边，则谓词 localFlowStep(Node nodeFrom, Node nodeTo) 成立。可以通过使用+或*运算符来递归地应用 localFlowStep，或者通过使用预定义的递归谓词 localFlow（相当于 localFlowStep*）

例如，可以在零个或多个局部步骤中找到从参数 source 到表达式 sink 的流：

1	DataFlow::localFlow(DataFlow::parameterNode(source), DataFlow::exprNode(sink))

使用局部污点跟踪

局部污点跟踪通过包含非保值流步骤来扩展局部数据流。例如：

1 2	String temp = x; String y = temp + ", " + temp;

如果 x 是污点字符串，那么 y 也是污点

局部污点跟踪库位于 TaintTracking 模块中。与局部数据流一样，如果存在一条从节点 nodeFrom 到节点 nodeTo 的实时污染传播边，则谓词 localTaintStep(DataFlow::Node nodeFrom, DataFlow::Node nodeTo) 成立。可以使用 + 和 * 运算符递归地应用谓词，或者使用预定义的递归谓词 localTaint（相当于 localTaintStep*）

例如，可以在零个或多个局部步骤中找到从参数 source 到表达式 sink 的污染传播：

1	TaintTracking::localTaint(DataFlow::parameterNode(source), DataFlow::exprNode(sink))

示例

此查询查找传递给新 new FileReader(..) 的文件名

import java  
  
from Constructor fileReader, Call call  
where  
  fileReader.getDeclaringType().hasQualifiedName("java.io", "FileReader") and  
  call.getCallee() = fileReader  
select call.getArgument(0)

Member::getDeclaringType：RefType getDeclaringType()，获取定义此成员的类型

但这只给出参数中的表达式，而不是可以传递给它的值。所以我们使用局部数据流来查找流入参数的所有表达式

import java
import semmle.code.java.dataflow.DataFlow

from Constructor fileReader, Call call, Expr src
where
  fileReader.getDeclaringType().hasQualifiedName("java.io", "FileReader") and
  call.getCallee() = fileReader and
  DataFlow::localFlow(DataFlow::exprNode(src), DataFlow::exprNode(call.getArgument(0)))
select src

然后我们可以使源更加具体，例如对一个公共参数的访问。此查询查找将公共参数传递给 new FileReader(..) 的位置：

import java
import semmle.code.java.dataflow.DataFlow

from Constructor fileReader, Call call, Parameter p
where
  fileReader.getDeclaringType().hasQualifiedName("java.io", "FileReader") and
  call.getCallee() = fileReader and
  DataFlow::localFlow(DataFlow::parameterNode(p), DataFlow::exprNode(call.getArgument(0)))
select p

此查询查找对格式字符串没有硬编码的格式化函数的调用

import java  
import semmle.code.java.dataflow.DataFlow  
import semmle.code.java.StringFormat  
  
from StringFormatMethod format, MethodAccess call, Expr formatString  
where  
call.getMethod() = format and  
call.getArgument(format.getFormatStringIndex()) = formatString and  
not exists(DataFlow::Node source, DataFlow::Node sink |  
DataFlow::localFlow(source, sink) and  
source.asExpr() instanceof StringLiteral and  
sink.asExpr() = formatString  
)  
select call, "Argument to String format method isn't hard-coded."

exists：exists(<variable declarations> | <formula>)。还可以写作exists(<variable declarations> | <formula 1> | <formula 2>)，相当于 exists(<variable declarations> | <formula 1> and <formula 2>)。这个函数引入了一些新的变量，如果变量至少有一组值可以使主体中的公式为真，则该函数成立。例如， exists(int i | i instanceof OneTwoThree) 引入int类型的临时变量i，如果i的任何值是OneTwoThree类型，则函数成立

StringLiteral：Class StringLiteral，字符串文本或文本块（Java 15特性）

练习

对应的练习答案

练习1：使用局部数据流编写一个查询，查找所有用于创建 java.net.URL 的硬编码字符串

全局数据流

全局数据流跟踪整个程序中的数据流，因此比局部数据流更强大。然而，全局数据流不如局部数据流精确，分析通常需要大量时间和内存。

使用全局数据流

可以通过扩展类 DataFlow::Configuration来使用全局数据流库

import semmle.code.java.dataflow.DataFlow

class MyDataFlowConfiguration extends DataFlow::Configuration {
  MyDataFlowConfiguration() { this = "MyDataFlowConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    ...
  }

  override predicate isSink(DataFlow::Node sink) {
    ...
  }
}

这些谓词在配置中定义：

isSource：定义了数据可能从何而来
isSink：定义了数据可能流向的位置
isBarrier：可选，限制数据流
isAdditionalFlowStep：可选，添加额外的流程步骤

特征谓词 MyDataFlowConfiguration() 定义了配置的名称，所以"MyDataFlowConfiguration"应该是唯一的名称，例如你的类名

使用谓词 hasFlow(DataFlow::Node source, DataFlow::Node sink)执行数据流分析：

1
2
3

from MyDataFlowConfiguration dataflow, DataFlow::Node source, DataFlow::Node sink  
where dataflow.hasFlow(source, sink)  
select source, "Data flow to $@.", sink, sink.toString()

使用全局污点跟踪

就像局部污点跟踪是对局部数据流的跟踪一样，全局污点跟踪是对全局数据流的跟踪。也就是说，全局污点跟踪通过额外的非保值步骤扩展了全局数据流。

Difference between DataFlow::Configuration and TaintTracking::Configuration

可以通过扩展类 TaintTracking::Configuration来使用全局污点跟踪库：

import semmle.code.java.dataflow.TaintTracking  
  
class MyTaintTrackingConfiguration extends TaintTracking::Configuration {  
  MyTaintTrackingConfiguration() { this = "MyTaintTrackingConfiguration" }  
  
  override predicate isSource(DataFlow::Node source) {  
    ...  
  }  
  
  override predicate isSink(DataFlow::Node sink) {  
    ...  
  }  
}

这些谓词在配置中定义：

isSource：定义了污点可能来自哪里
isSink：定义了污点可能流向哪里
isSanitizer：可选，限制污点的流动
isAdditionalTaintStep：可选，添加其他污点步骤

与全局数据流类似，特征谓词 MyTaintTrackingConfiguration() 定义了配置的唯一名称

污点跟踪分析使用谓词 hasFlow(DataFlow::Node source, DataFlow::Node sink)

Flow sources

数据流库包含一些预定义的流源。 RemoteFlowSource 类（在semmle.code.java.dataflow.FlowSources）中定义）表示可能由远程用户控制的数据流源，这对于查找安全问题很有用

示例

此查询显示使用远程用户输入作为数据源的污点跟踪配置

import java  
import semmle.code.java.dataflow.FlowSources  
  
class MyTaintTrackingConfiguration extends TaintTracking::Configuration {  
  MyTaintTrackingConfiguration() {  
    this = "..."  
  }  
  
  override predicate isSource(DataFlow::Node source) {  
    source instanceof RemoteFlowSource  
  }  
  
  ...  
}

练习

练习2：编写一个查询，使用全局数据流查找用于创建 java.net.URL的所有硬编码字符串

练习3：编写一个表示来自 java.lang.System.getenv(..)的流源的类

练习4：使用2和3中的答案，编写一个查询，查找所有从 getenv 到java.net.URL的全局数据流

练习答案

练习一

import semmle.code.java.dataflow.DataFlow  
  
from Constructor url, Call call, StringLiteral src  
where  
  url.getDeclaringType().hasQualifiedName("java.net", "URL") and  
  call.getCallee() = url and  
  DataFlow::localFlow(DataFlow::exprNode(src), DataFlow::exprNode(call.getArgument(0)))  
select src

练习二

import semmle.code.java.dataflow.DataFlow  
  
class Configuration extends DataFlow::Configuration {  
  Configuration() {  
    this = "LiteralToURL Configuration"  
  }  
  
  override predicate isSource(DataFlow::Node source) {  
    source.asExpr() instanceof StringLiteral  
  }  
  
  override predicate isSink(DataFlow::Node sink) {  
    exists(Call call |  
      sink.asExpr() = call.getArgument(0) and  
      call.getCallee().(Constructor).getDeclaringType().hasQualifiedName("java.net", "URL")  
    )  
  }  
}  
  
from DataFlow::Node src, DataFlow::Node sink, Configuration config  
where config.hasFlow(src, sink)  
select src, "This string constructs a URL $@.", sink, "here"

练习三

import java  
  
class GetenvSource extends MethodAccess {  
  GetenvSource() {  
    exists(Method m | m = this.getMethod() |  
      m.hasName("getenv") and  
      m.getDeclaringType() instanceof TypeSystem  
    )  
  }  
}

练习四

import semmle.code.java.dataflow.DataFlow  
  
class GetenvSource extends DataFlow::ExprNode {  
  GetenvSource() {  
    exists(Method m | m = this.asExpr().(MethodAccess).getMethod() |  
      m.hasName("getenv") and  
      m.getDeclaringType() instanceof TypeSystem  
    )  
  }  
}  
  
class GetenvToURLConfiguration extends DataFlow::Configuration {  
  GetenvToURLConfiguration() {  
    this = "GetenvToURLConfiguration"  
  }  
  
  override predicate isSource(DataFlow::Node source) {  
    source instanceof GetenvSource  
  }  
  
  override predicate isSink(DataFlow::Node sink) {  
    exists(Call call |  
      sink.asExpr() = call.getArgument(0) and  
      call.getCallee().(Constructor).getDeclaringType().hasQualifiedName("java.net", "URL")  
    )  
  }  
}  
  
from DataFlow::Node src, DataFlow::Node sink, GetenvToURLConfiguration config  
where config.hasFlow(src, sink)  
select src, "This environment variable constructs a URL $@.", sink, "here"

Java 中的类型

标准 CodeQL 库通过 Type 类及其各种子类来表示 Java 类型

PrimitiveType类表示Java语言中内置的基本类型（如boolean和int），而RefType及其子类表示引用类型，即类、接口、数组类型等。也包括来自Java标准库的类型（如Java.lang.Object）和由非库代码定义的类型

RefType 类还为类层次结构建模：成员谓词 getASupertype 和 getASubtype 可以查找引用类型的直接超类和子类。例如，对于以下 Java 程序：

class A {}  
  
interface I {}  
  
class B extends A implements I {}

类A有一个直接超类（java.lang.Object）和一个直接子类（B）；接口I也是如此。而类B有两个直接超类（A和I），没有直接子类

为了确定超类（包括直接超类，以及它们的超类等），我们可以使用传递闭包。例如，要在上面的示例中查找B的所有超类，我们可以使用以下查询：

import java  
  
from Class B  
where B.hasName("B")  
select B.getASupertype+()

如果在上面的示例代码段上运行此查询，则查询将返回A、I和java.lang.Object

除了类层次结构建模，RefType还提供成员谓词getAMember，用于访问类中声明的成员（即字段、构造函数和方法），以及谓词inherits(Method m)，用于检查类是否声明或继承方法m

示例：查找有问题的数组强制转换

作为如何使用类层次结构API的示例，我们可以编写一个查询来查找数组的向下转型，也就是某种类型A[]转换为类型B[]的表达式e（B是A的（不一定是直接的）子类）

这种类型的转换是有问题的，因为向下转换数组会导致运行时异常，即使每个数组元素都可以向下转换。例如，以下代码会引发ClassCastException：

1 2	Object[] o = new Object[] { "Hello", "world" }; String[] s = (String[])o;

另一方面，如果表达式e恰好计算为B[]数组，则转换将成功：

1 2	Object[] o = new String[] { "Hello", "world" }; String[] s = (String[])o;

在本文中，我们不尝试区分这两种情况。我们的查询应该只是简单地查找从 source 类转换为 target 类的转换表达式 ce：

source 和 target 都是数组类型
source 的元素类型是 target 元素类型的可传递超类

import java  
  
from CastExpr ce, Array source, Array target  
where source = ce.getExpr().getType() and  
    target = ce.getType() and  
    target.getElementType().(RefType).getASupertype+() = source.getElementType()  
select ce, "Potentially problematic array downcast."

请注意，通过将 target.getElementType() 转换为RefType，我们排除了所有元素类型为原始类型的情况，即 target是原始类型的数组：在这种情况下不会出现我们正在寻找的问题。与 Java 不同，QL 中的强制转换永远不会失败：如果无法将表达式强制转换为所需的类型，它会简单地从查询结果中排除，这也正是我们想要的

改进

在版本5之前的旧Java代码上运行此查询，通常会返回由于使用将集合转换为T[]类型的数组的方法Collection.toArray(T[])而产生的许多误报结果

在不使用泛型的代码中，这个方法通常如下使用：

1
2
3

List l = new ArrayList();  
// add some elements of type A to l  
A[] as = (A[])l.toArray(new A[0]);

这段代码中，l是原始类型List，所以l.toArray返回Object[]类型，与它的参数数组的类型无关。因此从Object[]转到A[]会被我们的查询标记为有问题，尽管在运行时，这个转换永远不会出错

为了识别这些情况，我们可以创建两个 CodeQL 类分别用来表示 Collection.toArray 方法和此方法或任何重写它的方法的调用：

/** class representing java.util.Collection.toArray(T[]) */  
class CollectionToArray extends Method {  
    CollectionToArray() {  
        this.getDeclaringType().hasQualifiedName("java.util", "Collection") and  
        this.hasName("toArray") and  
        this.getNumberOfParameters() = 1  
    }  
}  
  
/** class representing calls to java.util.Collection.toArray(T[]) */  
class CollectionToArrayCall extends MethodAccess {  
    CollectionToArrayCall() {  
        exists(CollectionToArray m |  
            this.getMethod().getSourceDeclaration().overridesOrInstantiates*(m)  
        )  
    }  
  
    /** the call's actual return type, as determined from its argument */  
    Array getActualReturnType() {  
        result = this.getArgument(0).getType()  
    }  
}

注意在CollectionToArrayCall的构造函数中使用了getSourceDeclaration和overridesOrInstantiates：我们希望找到对Collection.toArray方法和任何重写它的方法的调用，以及这些方法的任何参数化实例。例如，在上面的示例中，l.toArray解析为原始类型ArrayList中的toArray方法。其源声明是位于泛型类ArrayList<T>中的toArray，该类重写AbstractCollection<T>.toArray，这反过来会覆盖Collection<T>.toArray，它是Collection.toArray的一个实例化。（因为重写方法中的类型参数T属于ArrayList，并且是属于Collection的类型参数的实例）

使用这些新类，我们可以扩展查询，排除误报：

import java  
  
// Insert the class definitions from above  
  
from CastExpr ce, Array source, Array target  
where source = ce.getExpr().getType() and  
    target = ce.getType() and  
    target.getElementType().(RefType).getASupertype+() = source.getElementType() and  
    not ce.getExpr().(CollectionToArrayCall).getActualReturnType() = target  
select ce, "Potentially problematic array downcast."

示例：查找不匹配的contains

我们现在将编写一个查询来查找查询元素的类型与集合的元素类型无关的 Collection.contains 的使用

例如，Apache Zookeeper以前在类QuorumPeerConfig中有一段类似于以下内容的代码：

Map<Object, Object> zkProp;  
  
// ...  
  
if (zkProp.entrySet().contains("dynamicConfigFile")){  
    // ...  
}

由于zkProp是从Object到Object的映射，因此zkProp.entrySet返回一个Set<Entry<Object, Object>>类型的集合。这样的集合不可能包含String类型的元素（代码已被修复为使用zkProp.containsKey）

一般来说，我们希望找到对Collection.contains的调用（或任何Collection的参数化实例中的重写了它方法），而且集合元素的类型E和contains参数的类型A是不相关的，也就是说，它们没有共同的子类

首先创建一个描述java.util.Collection的类：

class JavaUtilCollection extends GenericInterface {  
    JavaUtilCollection() {  
        this.hasQualifiedName("java.util", "Collection")  
    }  
}

为了确保没有错误，可以运行一个简单的测试查询：

1 2	from JavaUtilCollection juc select juc

这个查询应该只返回一个结果

然后创建一个描述java.util.Collection.contains的类：

class JavaUtilCollectionContains extends Method {  
    JavaUtilCollectionContains() {  
        this.getDeclaringType() instanceof JavaUtilCollection and  
        this.hasStringSignature("contains(Object)")  
    }  
}

这里使用了hasStringSignature来检查以下项：

该方法的名称为contains
它只有一个参数
参数的类型是Object

或者可以使用 hasName，getNumberOfParameters，getParameter(0).getType() instanceof TypeObject 来分别实现这三项

现在我们要识别对Collection.contains的所有调用，包括任何重写它的方法，并考虑Collection的所有参数化实例以及其子类，编写如下

class JavaUtilCollectionContainsCall extends MethodAccess {
    JavaUtilCollectionContainsCall() {
        exists(JavaUtilCollectionContains jucc |
            this.getMethod().getSourceDeclaration().overrides*(jucc)
        )
    }
}

对于每次调用contains，我们关注的是参数的类型以及调用它的集合的元素类型。所以我们需要在类JavaUtilCollectionContainsCall中添加getArgumentType和getCollectionElementType这两个成员谓词

前者很简单：

1
2
3

Type getArgumentType() {  
    result = this.getArgument(0).getType()  
}

对于后者，我们将按以下步骤进行：

找到被调用的contains方法的声明类型D
找到D的超类S（或者D本身），且是java.util.Collection的参数化实例
返回S的类型参数

Type getCollectionElementType() {  
    exists(RefType D, ParameterizedInterface S |  
        D = this.getMethod().getDeclaringType() and  
        D.hasSupertype*(S) and S.getSourceDeclaration() instanceof JavaUtilCollection and  
        result = S.getTypeArgument(0)  
    )  
}

将这两个成员谓词添加到JavaUtilCollectionContainsCall中，我们还需要编写一个谓词来检查两个给定的引用类型是否具有公共子类：

1
2
3

predicate haveCommonDescendant(RefType tp1, RefType tp2) {  
    exists(RefType commondesc | commondesc.hasSupertype*(tp1) and commondesc.hasSupertype*(tp2))  
}

现在可以编写出查询的第一个版本

import java

// Insert the class definitions from above

from JavaUtilCollectionContainsCall juccc, Type collEltType, Type argType
where collEltType = juccc.getCollectionElementType() and argType = juccc.getArgumentType() and
    not haveCommonDescendant(collEltType, argType)
select juccc, "Element type " + collEltType + " is incompatible with argument type " + argType

改进

对于很多程序来说，由于类型变量和通配符，这个查询会产生大量的误报结果：如果集合元素类型是某个类型变量 E，参数类型是String，例如 CodeQL 会认为这两者没有共同子类，我们的查询将标记调用。排除此类误报结果的一种简单方法是简单地要求collEltType和argType都不是TypeVariable的实例

误报的另一个来源是原始类型的自动装箱：例如，如果集合的元素类型是Integer并且参数是int类型，则谓词haveCommonDescendant将失败，因为int不是 RefType。考虑到这一点，我们的查询应该检查collEltType不是argType的装箱类型

最后null是特殊的，因为它的类型（在 CodeQL 库中称为 <nulltype>）与每个引用类型兼容，因此我们应该将其排除在考虑之外

加上这三项改进，我们的最终查询是：

import java  
  
// Insert the class definitions from above  
  
from JavaUtilCollectionContainsCall juccc, Type collEltType, Type argType  
where collEltType = juccc.getCollectionElementType() and argType = juccc.getArgumentType() and  
    not haveCommonDescendant(collEltType, argType) and  
    not collEltType instanceof TypeVariable and not argType instanceof TypeVariable and  
    not collEltType = argType.(PrimitiveType).getBoxedType() and  
    not argType.hasName("<nulltype>")  
select juccc, "Element type " + collEltType + " is incompatible with argument type " + argType