Semgrep has an experimental and (IMO) more readable rule syntax. I am converting my own reference into a tutorial.
Disclaimer: Semgrep (binary, playground, cloud, etc.) supports the experimental syntax, but it's not released. If you're from the future and things have changed, let me know somehow. E.g., make an issue in the blog's source at parsiya/parsiya.net or create a pull request.
Use these tables:
| Old | Experimental |
|---|---|
| patterns (top-level) | match and all |
| patterns (other) | all |
| pattern | [can be removed] |
| pattern-not | - not |
| pattern-either | any |
| pattern-inside | inside |
| pattern-not-inside | inside under not |
These items go inside a where clause:
| Old | Experimental |
|---|---|
| metavariable-pattern | metavariable and pattern |
| metavariable-regex | metavariable and regex |
| metavariable-comparison | metavariable and comparison |
| metavariable-analysis | metavariable and analyzer |
| focus-metavariable | focus |
Taint mode changes
| Old | Experimental |
|---|---|
| mode:taint | removed |
| match (taint mode) | taint |
| pattern-sources | sources |
| pattern-sinks | sinks |
| pattern-propagators | propagators |
| pattern-sanitizers | sanitizers |
I've only been able to find two references so far:
Modified version of the first example in the Advanced Rule Tutorials, practice playground link.
rules:
- id: blog-2023-10-use-decimalfield-for-money-old
patterns:
# I know this `patterns` can be replaced by one `pattern`
# but it's modified for the tutorial.
- patterns:
- pattern: $F = django.db.models.FloatField(...)
- pattern: $F = django.db.models.FloatField(...)
- pattern-inside: |
class $M(...):
...
- metavariable-regex:
metavariable: '$F'
regex: '.*(price|fee|salary).*'
message: _removed_
languages: [python]
severity: ERROR
The top-level pattern or patterns becomes match. It's almost always
followed by all or any.
rules:
- id: use-decimalfield-for-money-new-syntax
# top-level patterns replaced by match and all.
match:
# the rest of the patterns
# # I know this `patterns` can be replaced by one `pattern`
# # but it's modified for the tutorial.
# - patterns:
# - pattern: $F = django.db.models.FloatField(...)
# - pattern: $F = django.db.models.FloatField(...)
# - pattern-inside: |
# class $M(...):
# ...
# - metavariable-regex:
# metavariable: '$F'
# regex: '.*(price|fee|salary).*'
message: _removed_
languages: [python]
severity: ERROR
Other patterns keys that are a subset of the top-level one are replaced by
all. Our example has a redundant patterns with two identical children to
show how it will be modified.
Note that if we had a pattern-either here we would use any.
rules:
- id: use-decimalfield-for-money-new-syntax
# top-level patterns replaced by match and all.
match:
all:
# rest of the patterns
- pattern: $F = django.db.models.FloatField(...)
- pattern: $F = django.db.models.FloatField(...)
# - pattern-inside: |
# class $M(...):
# ...
# - metavariable-regex:
# metavariable: '$F'
# regex: '.*(price|fee|salary).*'
message: _removed_
languages: [python]
severity: ERROR
The pattern keyword can be omitted. Replace pattern: [something] with just
- [something].
- pattern: [something] ---> - [something]
- pattern: | ---> - |
[something] [something]
[more lines] [more lines]
More changes:
rules:
- id: use-decimalfield-for-money-new-syntax
# top-level patterns replaced by match and all.
match:
all:
# the rest of the patterns
# I know this `patterns` can be replaced by one `pattern`
# but it's modified for the tutorial.
- $F = django.db.models.FloatField(...)
- |
$F = django.db.models.FloatField(...)
# - pattern-inside: |
# class $M(...):
# ...
# - metavariable-regex:
# metavariable: '$F'
# regex: '.*(price|fee|salary).*'
message: _removed_
languages: [python]
severity: ERROR
There's one catch, if your pattern contains : it might mess with the yaml
format. Either use a bar to send it to the next line or enclose it in ",
explanation at 1:26 in the reference video.
We don't have it in our current example, but it's similar to pattern.
- pattern-not: [something] ---> - not: [something]
- pattern-not: | ---> - not: |
[something] [something]
[more lines] [more lines]
Easy, peasy.
rules:
- id: use-decimalfield-for-money-new-syntax
# top-level patterns replaced by match and all.
match:
all:
# the rest of the patterns
# I know this `patterns` can be replaced by one `pattern`
# but it's modified for the tutorial.
- $F = django.db.models.FloatField(...)
- |
$F = django.db.models.FloatField(...)
# pattern-inside
- inside: |
class $M(...):
...
# - metavariable-regex:
# metavariable: '$F'
# regex: '.*(price|fee|salary).*'
message: _removed_
languages: [python]
severity: ERROR
Acts as a container for some elements that add conditions to metavariables. We
will use metavariable-regex as an example:
where clause in the same level as allmetavariable-regex is also replaced with metavariable and regex.rules:
- id: use-decimalfield-for-money-new-syntax
# top-level patterns replaced by match and all.
match:
all:
# I know this `patterns` can be replaced by one `pattern`
# but it's modified for the tutorial.
- $F = django.db.models.FloatField(...)
- |
$F = django.db.models.FloatField(...)
# pattern-inside
- inside: |
class $M(...):
...
where:
# metavariable-regex
- metavariable: $F
regex: '.*(price|fee|salary).*'
message: _removed_
languages: [python]
severity: ERROR
See the final rule in the playground.
Other elements that appear under where have also been modified:
metavariable-patternmetavariable-analysismetavariable-comparisonfocus-metavariableWe can use them like this:
rules:
- id: sample-rule
match:
all:
# removed
where:
# metavariable-regex
- metavariable: $F
regex: '.*(price|fee|salary).*'
# metavariable-analysis
- metavariable: $F
analyzer: redos
# focus-metavariable becomes `focus`
- focus: $F
message: _removed_
languages: [python]
severity: ERROR
metavariable-pattern is tricky because it can contain multiple patterns, but
it's similar to the patterns we've seen before.
where:
# metavariable-pattern
- metavariable: $F
pattern: "some pattern"
# if it had multiple patterns
- metavariable: $F
all:
- "pattern1"
- "pattern2"
This one is a C++ Hotspot rule that tracks when arrays are passed to functions. The complete rule is on GitHub and has a handy triage guide.
I will be using a partial version of the rule, playground link.
rules:
- id: arrays-passed-to-functions-partial
patterns:
# a lot of ways to create an array
- pattern-either:
- pattern-inside: |
$TYPE $BUF[$SIZE] = $EXPR;
...
- pattern-inside: |
$TYPE $BUF[$SIZE];
...
# we don't want to flag these usages again
- pattern-not-inside: free($BUF);
- pattern-not-inside: delete($BUF);
# exclude uppercase variables, these are usually constants
- metavariable-regex:
metavariable: $BUF
regex: (?![A-Z0-9_]+\b)
# flag if it's passed to a function
- pattern: $FUNC(..., $BUF, ...);
message: _removed_
languages:
- cpp
severity: WARNING
The only new item here is pattern-not-inside.
rules:
- id: arrays-passed-to-functions-partial
match:
# removed everything else
- pattern-not-inside: free($BUF);
- pattern-not-inside: delete($BUF);
First, we create a not and then add an inside under it. Also note how the
inside is indented unlike - not: [pattern] (from pattern-not).
rules:
- id: arrays-passed-to-functions-partial
match:
# removed everything else
- not:
inside: free($BUF);
- not:
inside: delete($BUF);
I thought I could merge the two nots. You cannot. It's a map and if you add
two inside, you will get an error that keys must be unique.
any will act as OR.
- pattern-either:
- pattern-inside: |
$TYPE $BUF[$SIZE] = $EXPR;
...
- pattern-inside: |
$TYPE $BUF[$SIZE];
...
becomes:
- any:
- inside: |
$TYPE $BUF[$SIZE] = $EXPR;
...
- inside: |
$TYPE $BUF[$SIZE];
...
The rest is routine:
patterns -> match.pattern-either -> any.pattern-not-inside -> not and inside.metavariable-regex -> metavariable and regex.pattern (the word) is just removed.rules:
- id: arrays-passed-to-functions-partial
match:
# a lot of ways to create an array
- any:
- inside: |
$TYPE $BUF[$SIZE] = $EXPR;
...
- inside: |
$TYPE $BUF[$SIZE];
...
# we don't want to flag these usages again
- pattern-not-inside: free($BUF);
- pattern-not-inside: delete($BUF);
# exclude uppercase variables, these are usually constants
- metavariable-regex:
metavariable: $BUF
regex: (?![A-Z0-9_]+\b)
# flag if it's passed to a function
- pattern: $FUNC(..., $BUF, ...);
message: _removed_
languages:
- cpp
severity: WARNING
The original rule and the one we created do not have the same matches. The original rule has three matches, playground link.
old rule
The modified rule only returns one match, playground link.
new rule
The reason is that constant propagation is on by default in the experimental syntax (at least for now). Credit: Cooper Pierce, Semgrep.
We can get the same result by adding an options key and get the same matches, playground link.
rules:
- id: blah-blah
options:
constant_propagation: false
match:
all:
# the rest of the rule
In the last example we will look at a complex metavariable-pattern rule from
Semgrep examples, playground link for practice.
rules:
- id: blog-2023-10-open-redirect-old
languages:
- python
message: Match found
severity: WARNING
patterns:
- pattern-inside: |
def $FUNC(...):
...
return django.http.HttpResponseRedirect(..., $DATA, ...)
- metavariable-pattern:
metavariable: $DATA
patterns:
# patterns
Converting the outside patterns is easy.
rules:
- id: blog-2023-10-open-redirect-new
languages:
- python
message: Match found
match:
all:
- inside: |
def $FUNC(...):
...
return django.http.HttpResponseRedirect(..., $DATA, ...)
where:
- metavariable: $DATA
patterns:
# patterns
Now we do the same process for the inner patterns and replace patterns with
all (we don't need a match). Things can get complicated quickly. We have
three nested where clauses. One for the top metavariable-pattern, another
for the 2nd one, and the last one is for metavariable-regex.
The result is in this playground link.
rules:
- id: blog-2023-10-open-redirect-new
languages:
- python
message: Match found
severity: WARNING
match:
all:
- inside: |
def $FUNC(...):
...
return django.http.HttpResponseRedirect(..., $DATA, ...)
where:
- metavariable: $DATA
all:
- any:
- $REQUEST
- $STR.format(..., $REQUEST, ...)
- $STR % $REQUEST
- $STR + $REQUEST
- f"...{$REQUEST}..."
where:
- metavariable: $REQUEST
all:
- any:
- request.$W
- request.$W.get(...)
- request.$W(...)
- request.$W[...]
where:
- metavariable: $W
regex: (?!get_full_path)
We learned to convert rules from the old Semgrep syntax to the experimental one. IMO, the experimental syntax is more readable. There are some inconsistencies like the constant propagation section (and probably more), but not a big issue.