使用yarGen高效编写Yara规则-天翼云开发者社区

0x00 简介

Yara最初是Virustotal开源的一套通过特征码识别恶意样本的工具，包含yara扫描引擎和yara规则。yara规则一般由一组字符串和一个确定其真假的布尔型表达式组成的文本文件，一般包含三个部分：

元数据 meta：用于描述规则的基本信息,如规则名称、描述、作者等。
字符串特征码部分 strings: 定义要匹配的文本字符串或十六进制字符串。格式为以"$"开头的标识符
条件部分 condition: 决定规则逻辑的布尔表达式。引用字符串定义部分的标识符
关于每部分详细的介绍可参考官方文档Writing YARA rules

yarGen是Florian Roth开发的一款自动生成Yara规则的开源工具，可以自动分析恶意样本并生成Yara规则,减轻安全研究人员的工作量，该工具2015年被创建并持续开发维护，类似工具还包含yaraGenerator、yabin等；Evaluating Automatically Generated YARA Rules and Enhancing Their Effectiveness有对三款工具进行过对比测试，结果表明yarGen以明显优势胜出，其作者Florian Roth是Nextron Systems的研发主管，同时其也开发了Loki、Sigma等知名项目。

0x01 Yara规则评价指标

详细评价一条Yara规则比较复杂，除了二分类问题常见评价指标准确率、召回率等指标外，还包含效率、规则复杂度、可读性等多个指标，具体如下：

准确率(Precision): Yara规则能够准确匹配出目标恶意样本,不产生误报,避免匹配正常样本。因此在编写yara规则时需要尽可能的选择样本所特有的特征码，尽量不包含其他正常样本包含的特征码，yarGen工具通过黑名单方式避免使用正常样本所包含的字符串、opcode。
召回率(Recall)：Yara规则能够覆盖尽可能多的变种,匹配出尽可能多的目标恶意样本。如果yara规则只能检测到特定的样本，还不如直接使用HASH检测，因此要求yara规则还必须具备一定通用性，不仅针对已知样本还尽量匹配未知变种，不能过分依赖特定字符串特征，可选择一些难以修改的opcode作为特征。
规则质量：规则遵循编写规范,变量命名合理,条件合理,无语法错误。
规则复杂度：规则条件适度,不要过于简单也不要过于复杂。简单规则容易误报,复杂规则影效率。
规则更新频率：需要及时更新规则以适应恶意软件的变化。
规则数量：规则数量要适中,太少无法达到检测要求,太多会影响效率。
检出时间：规则运行效率,能够在短时间内给出检测结果。

准确率和召回率互相影响，理想状态下肯定追求两个都高，但是实际情况是两者相互“制约”：追求准确率高，则召回率就低；追求召回率高，则通常会影响准确率，因此还需要根据恶意代码家族特点选择特征码，Yara规则适合对包含已知特征的恶意样本进行检测,可以覆盖大部分类型的恶意代码，特别是针对一些非开源或者包含特定IOCs的样本，如一些黑客工具、包含特定漏洞利用代码的样本等。
同时一个yara规则想要达到最好的效果还需要扫描引擎配合，扫描引擎在保证扫描效率的同时要保证Yara规则的机密性，避免规则泄露，提高对抗门槛。

0x02 yarGen使用

使用

yarGen使用非常简单，生成第一个yara规则只需要四步：

clone项目：git clone github.com/Neo23x0/yarGen
安装依赖：pip3 install -r requirements.txt
下载更新内置数据库： python yarGen.py --update
生成yara规则：python yarGen.py -m PATH_TO_MALWARE_DIRECTORY 默认生成文件名为yargen_rules.yar的yara规则
这是最简单的使用方式，全部使用默认参数生成yara规则，同时参数也可由用户自定义

usage: yarGen.py [-h] [-m M] [-y min-size] [-z min-score] [-x high-scoring] [-w superrule-overlap] [-s max-size]
[-rc maxstrings] [--excludegood] [-o output_rule_file] [-e output_dir_strings] [-a author] [-r ref]
[-l lic] [-p prefix] [-b identifier] [--score] [--strings] [--nosimple] [--nomagic] [--nofilesize]
[-fm FM] [--globalrule] [--nosuper] [--update] [-g G] [-u] [-c] [-i I] [--dropzone] [--nr] [--oe]
[-fs size-in-MB] [--noextras] [--ai] [--debug] [--trace] [--opcodes] [-n opcode-num]

yarGen

optional arguments:
-h, --help show this help message and exit

Rule Creation:
-m M Path to scan for malware
-y min-size Minimum string length to consider (default=8)
-z min-score Minimum score to consider (default=0)
-x high-scoring Score required to set string as 'highly specific string' (default: 30)
-w superrule-overlap Minimum number of strings that overlap to create a super rule (default: 5)
-s max-size Maximum length to consider (default=128)
-rc maxstrings Maximum number of strings per rule (default=20, intelligent filtering will be applied)
--excludegood Force the exclude all goodware strings

Rule Output:
-o output_rule_file Output rule file
-e output_dir_strings
Output directory for string exports
-a author Author Name
-r ref Reference (can be string or text file)
-l lic License
-p prefix Prefix for the rule description
-b identifier Text file from which the identifier is read (default: last folder name in the full path, e.g.
"myRAT" if -m points to /mnt/mal/myRAT)
--score Show the string scores as comments in the rules
--strings Show the string scores as comments in the rules
--nosimple Skip simple rule creation for files included in super rules
--nomagic Don't include the magic header condition statement
--nofilesize Don't include the filesize condition statement
-fm FM Multiplier for the maximum 'filesize' condition value (default: 3)
--globalrule Create global rules (improved rule set speed)
--nosuper Don't try to create super rules that match against various files

Database Operations:
--update Update the local strings and opcodes dbs from the online repository
-g G Path to scan for goodware (dont use the database shipped with yaraGen)
-u Update local standard goodware database with a new analysis result (used with -g)
-c Create new local goodware database (use with -g and optionally -i "identifier")
-i I Specify an identifier for the newly created databases (good-strings-identifier.db, good-
opcodes-identifier.db)

General Options:
--dropzone Dropzone mode - monitors a directory [-m] for new samples to processWARNING: Processed files
will be deleted!
--nr Do not recursively scan directories
--oe Only scan executable extensions EXE, DLL, ASP, JSP, PHP, BIN, INFECTED
-fs size-in-MB Max file size in MB to analyze (default=10)
--noextras Don't use extras like Imphash or PE header specifics
--ai Create output to be used as ChatGPT4 input
--debug Debug output
--trace Trace output

Other Features:
--opcodes Do use the OpCode feature (use this if not enough high scoring strings can be found)
-n opcode-num Number of opcodes to add if not enough high scoring string could be found (default=3)

- -m 参数制定恶意样本存放目录，一般都会搜集同恶意代码家族多个样本
- --excludegood yarGen默认情况下不会完全排查goodware中包含的字符串，而是通过条件语句避免误报，通过制定--excludegood可强制排除goodware字符串
- --nosimple 生成的yara规则一般包含多个，只能检测单个样本的为simple规则 能同时识别多个样本的规则为super规则，如果指定了--nosimple规则则只会生成super规则。
- --score yarGen会给每个特征字符串计算出一个分数，特征字符串的分数越高，包含该字符串的文件是恶意软件文件的可能性就越高。
- --nomagic  指定输出的yara结果中不指定 格式魔数信息，通常在生成内存yara特征码使用

### 生成结果说明

下面是通过yarGen工具生成的minikatz恶意代码家族样本的其中一个yara规则，规则如下：

rule _mimikatz_full_x86_mimikatz_max_x86_mimikatz_min_x86_3 {
meta:
description = " - from files mimikatz-full.x86.dll, mimikatz-max.x86.dll, mimikatz-min.x86.dll"
author = "yarGen Rule Generator"
reference = ""
date = "2023-11-17"
hash1 = "9ba86ae2808fe8df76a52001ef765b5ad3216447d0c0148dc719c6b9527c0e2d"
hash2 = "4cee15302a5e78ca9221c1fa2206e7bf97322fdf40580dc2df506901c8ba5c61"
hash3 = "8ebe20638b2a474870cc0a3a3286ebe6a4b5062e24600ff0ea9de6af16548ee5"
strings:
$x1 = "ERROR kuhl_m_lsadump_dcsync ; kull_m_rpc_drsr_ProcessGetNCChangesReply" fullword wide /* score: '37.00'/
$x2 = "ERROR kull_m_rpc_drsr_ProcessGetNCChangesReply_decrypt ; Checksums don't match (C:0x%08x - R:0x%08x)" fullword wide / score: '33.00'/
$s3 = "ERROR kull_m_rpc_drsr_ProcessGetNCChangesReply_decrypt ; No Session Key" fullword wide / score: '28.00'/
$s4 = "ERROR kuhl_m_sekurlsa_acquireLSA ; Minidump pInfos->ProcessorArchitecture (%u) != PROCESSOR_ARCHITECTURE_INTEL (%u)" fullword wide / score: '28.00'/
$s5 = "ERROR kuhl_m_lsadump_dcsync ; DRSGetNCChanges, invalid dwOutVersion (%u) and/or cNumObjects (%u)" fullword wide / score: '26.00'/
$s6 = "ERROR kuhl_m_lsadump_dcsync ; GetNCChanges: 0x%08x (%u)" fullword wide / score: '26.00'/
$s7 = "ERROR kull_m_rpc_drsr_ProcessGetNCChangesReply_decrypt ; RtlDecryptData2" fullword wide / score: '25.00'/
$s8 = "ERROR kull_m_rpc_drsr_ProcessGetNCChangesReply_decrypt ; No valid data" fullword wide / score: '25.00'/
$s9 = "ERROR kuhl_m_lsadump_dcsync ; Missing user or guid argument" fullword wide / score: '24.00'/
$s10 = "ERROR kull_m_crypto_NCryptGetProperty ; NCryptGetProperty(%s) - simple DWORD: 0x%08x" fullword wide / score: '23.00'/
$s11 = "ERROR kull_m_crypto_NCryptGetProperty ; NCryptGetProperty(%s) - simple NCRYPT_HANDLE: 0x%08x" fullword wide / score: '23.00'/
$s12 = "ERROR kull_m_crypto_NCryptGetProperty ; NCryptGetProperty(%s) - data: 0x%08x" fullword wide / score: '23.00'/
$s13 = "ERROR kull_m_crypto_NCryptGetProperty ; NCryptGetProperty(%s) - init: 0x%08x" fullword wide / score: '23.00'/
$s14 = "ERROR kull_m_rpc_drsr_ProcessGetNCChangesReply ; Unable to MakeAttid for %S" fullword wide / score: '23.00'/
$s15 = "ERROR kuhl_m_lsadump_dcsync ; Domain Controller not present" fullword wide / score: '23.00'/
$s16 = "ERROR kuhl_m_lsadump_dcsync_decrypt ; RtlDecryptNtOwfPwdWithIndex/RtlDecryptLmOwfPwdWithIndex" fullword wide / score: '23.00'/
$s17 = "ERROR kuhl_m_lsadump_dcsync_descrObject_csv ; RtlDecryptNtOwfPwdWithIndex" fullword wide / score: '23.00'/
$s18 = "lsadump" fullword wide / score: '22.00'/
$s19 = "ERROR kuhl_m_sekurlsa_acquireLSA ; LSASS process not found (?)" fullword wide / score: '22.00'/
$s20 = "ERROR kull_m_rpc_drsr_ProcessGetNCChangesReply_decrypt ; Unable to calculate CRC32" fullword wide / score: '21.00'*/
condition:
( uint16(0) == 0x5a4d and filesize < 3000KB and ( 1 of ($x*) and 4 of them )
) or ( all of them )
}

strings部分通过不同命名分为三类：

$x: Highly Specific Strings 字符串是不会出现在合法软件中的非常具体的字符串。这些字符串可能包括恶意服务器地址、黑客工具和恶意软件的名称、黑客工具输出以及常见字符串中的拼写错误。例如，有时恶意软件文件在试图将自己伪装成合法软件时会包含拼写错误的单词，例如“Micorsoft”或“Monnitor”。
$s: 得分一般的字符串

结果测试

目前暂未找到非常好的能够覆盖所有测试指标的测试方法，但可以通过一些方法测试到部分指标

YARA-CI

YARA-CI 是Virustotal开发的服务，没有实际使用，按照说明需要yara规则上传到github，主要是误报测试，测试集合包含NRSL公布的100多万个白样本测试集。

yarAnalyzer

yarAnalyzer和yarGen为同一个作者开发，可用于测试yara规则的检出率，方便统计检出结果，不过需要自己提供测试样本集合，测试样本集可通过Virustotal下载，没有Vitustotal的可尝试通过MalwareBazaar下载样本集。

0x03 总结

使用yarGen编写可以极大提高样本分析师编写yara规则的效率，yarGen工具也在不断更新工具本身和其使用的黑名单数据库，提取特征字符串效果显著，但在字节码opcode特征提取效果感觉一般，应该是Binarly下线影响，没有比较好的opcode数据库。

0x04 参考链接

awesome-yara
Evaluating Automatically Generated YARA Rules and Enhancing Their Effectiveness
YARA-Style-Guide
yarAnalyzer

0x00 简介

元数据 meta：用于描述规则的基本信息,如规则名称、描述、作者等。
字符串特征码部分 strings: 定义要匹配的文本字符串或十六进制字符串。格式为以"$"开头的标识符
条件部分 condition: 决定规则逻辑的布尔表达式。引用字符串定义部分的标识符
关于每部分详细的介绍可参考官方文档Writing YARA rules

0x01 Yara规则评价指标

详细评价一条Yara规则比较复杂，除了二分类问题常见评价指标准确率、召回率等指标外，还包含效率、规则复杂度、可读性等多个指标，具体如下：

准确率(Precision): Yara规则能够准确匹配出目标恶意样本,不产生误报,避免匹配正常样本。因此在编写yara规则时需要尽可能的选择样本所特有的特征码，尽量不包含其他正常样本包含的特征码，yarGen工具通过黑名单方式避免使用正常样本所包含的字符串、opcode。
召回率(Recall)：Yara规则能够覆盖尽可能多的变种,匹配出尽可能多的目标恶意样本。如果yara规则只能检测到特定的样本，还不如直接使用HASH检测，因此要求yara规则还必须具备一定通用性，不仅针对已知样本还尽量匹配未知变种，不能过分依赖特定字符串特征，可选择一些难以修改的opcode作为特征。
规则质量：规则遵循编写规范,变量命名合理,条件合理,无语法错误。
规则复杂度：规则条件适度,不要过于简单也不要过于复杂。简单规则容易误报,复杂规则影效率。
规则更新频率：需要及时更新规则以适应恶意软件的变化。
规则数量：规则数量要适中,太少无法达到检测要求,太多会影响效率。
检出时间：规则运行效率,能够在短时间内给出检测结果。

0x02 yarGen使用

使用

yarGen使用非常简单，生成第一个yara规则只需要四步：

clone项目：git clone github.com/Neo23x0/yarGen
安装依赖：pip3 install -r requirements.txt
下载更新内置数据库： python yarGen.py --update
生成yara规则：python yarGen.py -m PATH_TO_MALWARE_DIRECTORY 默认生成文件名为yargen_rules.yar的yara规则
这是最简单的使用方式，全部使用默认参数生成yara规则，同时参数也可由用户自定义

yarGen

optional arguments:
-h, --help show this help message and exit

- -m 参数制定恶意样本存放目录，一般都会搜集同恶意代码家族多个样本
- --excludegood yarGen默认情况下不会完全排查goodware中包含的字符串，而是通过条件语句避免误报，通过制定--excludegood可强制排除goodware字符串
- --nosimple 生成的yara规则一般包含多个，只能检测单个样本的为simple规则 能同时识别多个样本的规则为super规则，如果指定了--nosimple规则则只会生成super规则。
- --score yarGen会给每个特征字符串计算出一个分数，特征字符串的分数越高，包含该字符串的文件是恶意软件文件的可能性就越高。
- --nomagic  指定输出的yara结果中不指定 格式魔数信息，通常在生成内存yara特征码使用

### 生成结果说明

下面是通过yarGen工具生成的minikatz恶意代码家族样本的其中一个yara规则，规则如下：

strings部分通过不同命名分为三类：

$x: Highly Specific Strings 字符串是不会出现在合法软件中的非常具体的字符串。这些字符串可能包括恶意服务器地址、黑客工具和恶意软件的名称、黑客工具输出以及常见字符串中的拼写错误。例如，有时恶意软件文件在试图将自己伪装成合法软件时会包含拼写错误的单词，例如“Micorsoft”或“Monnitor”。
$s: 得分一般的字符串

结果测试

目前暂未找到非常好的能够覆盖所有测试指标的测试方法，但可以通过一些方法测试到部分指标

YARA-CI

YARA-CI 是Virustotal开发的服务，没有实际使用，按照说明需要yara规则上传到github，主要是误报测试，测试集合包含NRSL公布的100多万个白样本测试集。

yarAnalyzer

0x03 总结

0x04 参考链接

awesome-yara
Evaluating Automatically Generated YARA Rules and Enhancing Their Effectiveness
YARA-Style-Guide
yarAnalyzer

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

使用yarGen高效编写Yara规则

0x00 简介

0x01 Yara规则评价指标

0x02 yarGen使用

使用

结果测试

YARA-CI

yarAnalyzer

0x03 总结

0x04 参考链接

使用yarGen高效编写Yara规则

0x00 简介

0x01 Yara规则评价指标

0x02 yarGen使用

使用

结果测试

YARA-CI

yarAnalyzer

0x03 总结

0x04 参考链接

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

使用yarGen高效编写Yara规则

0x00 简介

0x01 Yara规则评价指标

0x02 yarGen使用

使用

结果测试

YARA-CI

yarAnalyzer

0x03 总结

0x04 参考链接

使用yarGen高效编写Yara规则

0x00 简介

0x01 Yara规则评价指标

0x02 yarGen使用

使用

结果测试

YARA-CI

yarAnalyzer

0x03 总结

0x04 参考链接