Explicitly add encoding tables in slides.md

This commit is contained in:
cool-mist 2023-11-23 00:30:14 +05:30
parent 6b03053759
commit b504fc57e0

View File

@ -22,7 +22,7 @@ The terms character set is used interchangably with character encoding and code
## Latin ## ## Latin ##
- Also called ISO-8859-1 character set. - Also called ISO-8859-1 character set.
- This is an extension of ASCII and covers the Latin alphabet - think english looking alphabets with diacritics. eg: Àä - This is an extension of ASCII and covers the Latin alphabet - À,ä...
- Number mappings upto 255. - Number mappings upto 255.
## Windows 1252 ## ## Windows 1252 ##
@ -30,7 +30,6 @@ The terms character set is used interchangably with character encoding and code
- Super set of Latin character set. - Super set of Latin character set.
- Introduced by Microsoft. - Introduced by Microsoft.
## Unicode ## ## Unicode ##
- Capable of defining a mapping for 1.1 million characters. - Capable of defining a mapping for 1.1 million characters.
@ -40,7 +39,6 @@ The terms character set is used interchangably with character encoding and code
- Emojis 😮, 🤔 - Emojis 😮, 🤔
- Math ∫x.dx - Math ∫x.dx
--- ---
# Common encoding schemes # # Common encoding schemes #
@ -56,9 +54,14 @@ The terms character set is used interchangably with character encoding and code
- Today, it only accounts for 1.4% of the internet traffic. - Today, it only accounts for 1.4% of the internet traffic.
``` ```
~~~enc-check -8 abcd ┌───────┬───────┬───────────┬──────┬─────┬─────┬──────────┐
│ U+dec │ U+hex │ character │ byte │ hex │ dec │ bin │
~~~ ├───────┼───────┼───────────┼──────┼─────┼─────┼──────────┤
│ 97 │ 61 │ a │ 0 │ 61 │ 97 │ 01100001 │
│ 98 │ 62 │ b │ 1 │ 62 │ 98 │ 01100010 │
│ 99 │ 63 │ c │ 2 │ 63 │ 99 │ 01100011 │
│ 100 │ 64 │ d │ 3 │ 64 │ 100 │ 01100100 │
└───────┴───────┴───────────┴──────┴─────┴─────┴──────────┘
``` ```
--- ---
@ -87,9 +90,22 @@ The terms character set is used interchangably with character encoding and code
``` ```
~~~enc-check -8 abஐह🤔 ┌────────┬───────┬───────────┬──────┬─────┬─────┬──────────┐
│ U+dec │ U+hex │ character │ byte │ hex │ dec │ bin │
~~~ ├────────┼───────┼───────────┼──────┼─────┼─────┼──────────┤
│ 97 │ 61 │ a │ 0 │ 61 │ 97 │ 01100001 │
│ 98 │ 62 │ b │ 1 │ 62 │ 98 │ 01100010 │
│ 2960 │ b90 │ ஐ │ 2 │ e0 │ 224 │ 11100000 │
│ │ │ │ 3 │ ae │ 174 │ 10101110 │
│ │ │ │ 4 │ 90 │ 144 │ 10010000 │
│ 2361 │ 939 │ ह │ 5 │ e0 │ 224 │ 11100000 │
│ │ │ │ 6 │ a4 │ 164 │ 10100100 │
│ │ │ │ 7 │ b9 │ 185 │ 10111001 │
│ 129300 │ 1f914 │ 🤔 │ 8 │ f0 │ 240 │ 11110000 │
│ │ │ │ 9 │ 9f │ 159 │ 10011111 │
│ │ │ │ 10 │ a4 │ 164 │ 10100100 │
│ │ │ │ 11 │ 94 │ 148 │ 10010100 │
└────────┴───────┴───────────┴──────┴─────┴─────┴──────────┘
``` ```
--- ---
@ -107,10 +123,22 @@ The terms character set is used interchangably with character encoding and code
- 2 or 4 bytes to represent a unicode code point. - 2 or 4 bytes to represent a unicode code point.
``` ```
~~~enc-check -6 abஐह🤔 ┌────────┬───────┬───────────┬──────┬─────┬─────┬──────────┐
│ U+dec │ U+hex │ character │ byte │ hex │ dec │ bin │
~~~ ├────────┼───────┼───────────┼──────┼─────┼─────┼──────────┤
│ 97 │ 61 │ a │ 0 │ 00 │ 0 │ 00000000 │
│ │ │ │ 1 │ 61 │ 97 │ 01100001 │
│ 98 │ 62 │ b │ 2 │ 00 │ 0 │ 00000000 │
│ │ │ │ 3 │ 62 │ 98 │ 01100010 │
│ 2960 │ b90 │ ஐ │ 4 │ 0b │ 11 │ 00001011 │
│ │ │ │ 5 │ 90 │ 144 │ 10010000 │
│ 2361 │ 939 │ ह │ 6 │ 09 │ 9 │ 00001001 │
│ │ │ │ 7 │ 39 │ 57 │ 00111001 │
│ 129300 │ 1f914 │ 🤔 │ 8 │ d8 │ 216 │ 11011000 │
│ │ │ │ 9 │ 3e │ 62 │ 00111110 │
│ │ │ │ 10 │ dd │ 221 │ 11011101 │
│ │ │ │ 11 │ 14 │ 20 │ 00010100 │
└────────┴───────┴───────────┴──────┴─────┴─────┴──────────┘
``` ```
--- ---
@ -127,20 +155,26 @@ The terms character set is used interchangably with character encoding and code
- Encode the string in one of the encoding schemes. - Encode the string in one of the encoding schemes.
- If a particular character cannot appear in the url string, or is not ASCII, print the hex representation of the string, prefixed with a `%`. - If a particular character cannot appear in the url string, or is not ASCII, print the hex representation of the string, prefixed with a `%`.
``` ```
~~~enc-check -8 &? ┌───────┬───────┬───────────┬──────┬─────┬─────┬──────────┐
│ U+dec │ U+hex │ character │ byte │ hex │ dec │ bin │
~~~ ├───────┼───────┼───────────┼──────┼─────┼─────┼──────────┤
│ 38 │ 26 │ & │ 0 │ 26 │ 38 │ 00100110 │
│ 63 │ 3f │ ? │ 1 │ 3f │ 63 │ 00111111 │
└───────┴───────┴───────────┴──────┴─────┴─────┴──────────┘
``` ```
- For example, if the url string `p1&/pw?` were to be url-encoded under utf-8 encoding, then it would be `p1%26/pw%3f` - For example, if the url string `p1&/pw?` were to be url-encoded under utf-8 encoding, then it would be `p1%26/pw%3f`
``` ```
~~~enc-check -6 &? ┌───────┬───────┬───────────┬──────┬─────┬─────┬──────────┐
│ U+dec │ U+hex │ character │ byte │ hex │ dec │ bin │
~~~ ├───────┼───────┼───────────┼──────┼─────┼─────┼──────────┤
│ 38 │ 26 │ & │ 0 │ 00 │ 0 │ 00000000 │
│ │ │ │ 1 │ 26 │ 38 │ 00100110 │
│ 63 │ 3f │ ? │ 2 │ 00 │ 0 │ 00000000 │
│ │ │ │ 3 │ 3f │ 63 │ 00111111 │
└───────┴───────┴───────────┴──────┴─────┴─────┴──────────┘
``` ```
- Under utf-16 encoding, it would be `p1%00%26/pw%00%3f` - Under utf-16 encoding, it would be `p1%00%26/pw%00%3f`
@ -152,7 +186,6 @@ The terms character set is used interchangably with character encoding and code
- Support Unicode code points encoded as utf-8 characters. - Support Unicode code points encoded as utf-8 characters.
- URL encode under utf-8. - URL encode under utf-8.
--- ---
# What is a character? # # What is a character? #
@ -160,11 +193,17 @@ The terms character set is used interchangably with character encoding and code
- It is a group of unicode code points - also called a grapheme cluster. - It is a group of unicode code points - also called a grapheme cluster.
- Eg: the character 'ப்' consists of 2 unicode code points as seen below. - Eg: the character 'ப்' consists of 2 unicode code points as seen below.
``` ```
~~~enc-check -8 ப் ┌───────┬───────┬───────────┬──────┬─────┬─────┬──────────┐
│ U+dec │ U+hex │ character │ byte │ hex │ dec │ bin │
~~~ ├───────┼───────┼───────────┼──────┼─────┼─────┼──────────┤
│ 2986 │ baa │ ப │ 0 │ e0 │ 224 │ 11100000 │
│ │ │ │ 1 │ ae │ 174 │ 10101110 │
│ │ │ │ 2 │ aa │ 170 │ 10101010 │
│ 3021 │ bcd │ ் | 3 │ e0 │ 224 │ 11100000 │
│ │ │ │ 4 │ af │ 175 │ 10101111 │
│ │ │ │ 5 │ 8d │ 141 │ 10001101 │
└───────┴───────┴───────────┴──────┴─────┴─────┴──────────┘
``` ```
- Number of characters in a string is often different from `string.Length`. - Number of characters in a string is often different from `string.Length`.
@ -174,9 +213,14 @@ The terms character set is used interchangably with character encoding and code
``` ```
~~~enc-check -6 🤔 ┌────────┬───────┬───────────┬──────┬─────┬─────┬──────────┐
│ U+dec │ U+hex │ character │ byte │ hex │ dec │ bin │
~~~ ├────────┼───────┼───────────┼──────┼─────┼─────┼──────────┤
│ 129300 │ 1f914 │ 🤔 │ 0 │ d8 │ 216 │ 11011000 │
│ │ │ │ 1 │ 3e │ 62 │ 00111110 │
│ │ │ │ 2 │ dd │ 221 │ 11011101 │
│ │ │ │ 3 │ 14 │ 20 │ 00010100 │
└────────┴───────┴───────────┴──────┴─────┴─────┴──────────┘
``` ```
- Be careful about advertising character length limitations. - Be careful about advertising character length limitations.