前几天看了浏览器解码看XSS,没有看得很明白,又找了这篇深入理解浏览器解析机制和XSS向量编码,翻译的文章,有些地方翻译的怪怪的,需要看下原文,啃了2天终于搞明白了
原文里给出了几个XSS Payload,也给出了答案和演示地址,有答案但没解析,下面一个个分析
有点像上学的时候,看书看不懂,做题不会做,看答案解析做题就懂了
<a href="%6a%61%76%61%73%63%72%69%70%74:%61%6c%65%72%74%28%31%29"></a>
URL encoded "javascript:alert(1)"
Answer: The javascript will NOT execute.
里面没有HTML编码内容,不考虑,其中href内部是URL,于是直接丢给URL模块处理,但是协议无法识别(即被编码的javascript:
),解码失败,不会被执行
URL规定协议,用户名,密码都必须是ASCII,编码当然就无效了
A URL’s scheme is an ASCII string that identifies the type of URL and can be used to dispatch a URL for further processing after parsing. It is initially the empty string.
A URL’s username is an ASCII string identifying a username. It is initially the empty string.
A URL’s password is an ASCII string identifying a password. It is initially the empty string.
<a href="javascript:%61%6c%65%72%74%28%32%29">
Character entity encoded "javascript" and URL encoded "alert(2)"
Answer: The javascript will execute.
先HTML解码,得到
<a href="javascript:%61%6c%65%72%74%28%32%29">
href中为URL,URL模块可识别为javascript
协议,进行URL解码,得到
<a href="javascript:alert(2)">
由于是javascript协议,解码完给JS模块处理,于是被执行
<a href="javascript%3aalert(3)"></a>
URL encoded ":"
Answer: The javascript will NOT execute.
同1,不解释
<div><img src=x onerror=alert(4)></div>
Character entity encoded < and >
Answer: The javascript will NOT execute.
这里包含了HTML编码内容,反过来以开发者的角度思考,HTML编码就是为了显示这些特殊字符,而不干扰正常的DOM解析,所以这里面的内容不会变成一个img元素,也不会被执行
从HTML解析机制看,在读取<div>
之后进入数据状态,<
会被HTML解码,但不会进入标签开始状态,当然也就不会创建img
元素,也就不会执行
<textarea><script>alert(5)</script></textarea>
Character entity encoded < and >
Answer: The javascript will NOT execute AND the character entities will NOT
be decoded either
<textarea>
是RCDATA
元素(RCDATA elements),可以容纳文本和字符引用,注意不能容纳其他元素,HTML解码得到
<textarea><script>alert(5)</script></textarea>
于是直接显示
RCDATA
元素(RCDATA elements)包括textarea
和title
<textarea><script>alert(6)</script></textarea>
Answer: The javascript will NOT execute.
同5,不解释
<button onclick="confirm('7');">Button</button>
Character entity encoded '
Answer: The javascript will execute.
这里onclick
中为标签的属性值(类比2中的href
),会被HTML解码,得到
<button onclick="confirm('7');">Button</button>
然后被执行
<button onclick="confirm('8\u0027);">Button</button>
Unicode escape sequence encoded '
Answer: The javascript will NOT execute.
onclick
中的值会交给JS处理,在JS中只有字符串和标识符能用Unicode表示,'
显然不行,JS执行失败
In string literals, regular expression literals, template literals and identifiers, any Unicode code point may also be expressed using Unicode escape sequences that explicitly express a code point's numeric value.
from https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-source-code (这个链接很卡)
标识符(identifiers)
代码中用来标识变量、函数、或属性的字符序列。
在JavaScript中,标识符只能包含字母或数字或下划线(“_”)或美元符号(“$”),且不能以数字开头。标识符与字符串不同之处在于字符串是数据,而标识符是代码的一部分。在 JavaScript 中,无法将标识符转换为字符串,但有时可以将字符串解析为标识符。from https://developer.mozilla.org/zh-CN/docs/Glossary/Identifier
<script>alert(9);</script>
Character entity encoded alert(9);
Answer: The javascript will NOT execute.
script
属于原始文本元素(Raw text elements),只可以容纳文本,注意没有字符引用,于是直接由JS处理,JS也认不出来,执行失败
原始文本元素(Raw text elements)有<script>
和<style>
<script>\u0061\u006c\u0065\u0072\u0074(10);</script>
Unicode Escape sequence encoded alert
Answer: The javascript will execute.
同8,函数名alert
属于标识符,直接被JS执行
<script>\u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0031\u0029</script>
Unicode Escape sequence encoded alert(11)
Answer: The javascript will NOT execute.
同8,不解释
<script>\u0061\u006c\u0065\u0072\u0074(\u0031\u0032)</script>
Unicode Escape sequence encoded alert and 12
Answer: The javascript will NOT execute.
这里看似将没毛病,但是这里\u0031\u0032
在解码的时候会被解码为字符串12
,注意是字符串,不是数字,文字显然是需要引号的,JS执行失败
<script>alert('13\u0027)</script>
Unicode escape sequence encoded '
Answer: The javascript will NOT execute.
同8
<script>alert('14\u000a')</script>
Unicode escape sequence encoded line feed.
Answer: The javascript will execute.
\u000a
在JavaScript里是换行,就是\n
,直接执行
Java菜鸡才知道在Java里\u000a
是换行,相当于在源码里直接按一下回车键,后面的代码都换行了
ECMAScript differs from the Java programming language in the behaviour of Unicode escape sequences. In a Java program, if the Unicode escape sequence \u000A, for example, occurs within a single-line comment, it is interpreted as a line terminator (Unicode code point U+000A is LINE FEED (LF)) and therefore the next code point is not part of the comment. Similarly, if the Unicode escape sequence \u000A occurs within a string literal in a Java program, it is likewise interpreted as a line terminator, which is not allowed within a string literal—one must write \n instead of \u000A to cause a LINE FEED (LF) to be part of the String value of a string literal. In an ECMAScript program, a Unicode escape sequence occurring within a comment is never interpreted and therefore cannot contribute to termination of the comment. Similarly, a Unicode escape sequence occurring within a string literal in an ECMAScript program always contributes to the literal and is never interpreted as a line terminator or as a code point that might terminate the string literal.
from https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-source-code
<a href="javascript:%5c%75%30%30%36%31%5c%75%30%30%36%63%5c%75%30%30%36%35%5c%75%30%30%37%32%5c%75%30%30%37%34(15)"></a>
Answer: The javascript will execute.
先HTML解码,得到
<a href="javascript:%5c%75%30%30%36%31%5c%75%30%30%36%63%5c%75%30%30%36%35%5c%75%30%30%37%32%5c%75%30%30%37%34(15)"></a>
在href中由URL模块处理,解码得到
javascript:\u0061\u006c\u0065\u0072\u0074(15)
识别JS协议,然后由JS模块处理,解码得到
javascript:alert(15)
最后被执行
<script>
和<style>
数据只能有文本,不会有HTML解码和URL解码操作
<textarea>
和<title>
里会有HTML解码操作,但不会有子元素
其他元素数据(如div
)和元素属性数据(如href
)中会有HTML解码操作
部分属性(如href
)会有URL解码操作,但URL中的协议需为ASCII
JavaScript会对字符串和标识符Unicode解码
根据浏览器的自动解码,反向构造 XSS Payload 即可