# Characters over the wire #

Standards on sending, and parsing characters over the web.

## Basic idea ##

- **Assign** a number to each character using a Character set.
- **Encode** the number to bytes using an encoding scheme.
- Transfer bytes over the internet

The terms character set is used interchangably with character encoding and code pages.

---

# Common character sets #

## ASCII ##

- It assigns character to number mapping from 0-127 and covers english characters and some control codes (eg: new lines, tabs)
- Not everything from 0-127 is mapped.

## Latin ##

- Also called ISO-8859-1 character set.
- This is an extension of ASCII and covers the Latin alphabet - think english looking alphabets with diacritics. eg: Àä
- Number mappings upto 255.

## Windows 1252 ##

- Super set of Latin character set.
- Introduced by Microsoft.


## Unicode ##

- Capable of defining a mapping for 1.1 million characters.
- Currently 150000 are defined.
- Each mapping is also called a unicode code point.
- Most languages - ஐ, ह
- Emojis 😮, 🤔
- Math ∫x.dx


---

# Common encoding schemes #

- An encoding scheme will encode the number to one or more bytes.

## Single byte encoding schemes ##

- Uses up only one byte.
- Suitable for ASCII, Latin and Windows 1252 character sets.
- ASCII would only take up 7 bits, while Latin and Windows 1252 would take up 8 bits.
- Because Windows 1252 is a superset of Latin, which is also a super set of ASCII, for a very long time in the past, the most used encoding scheme was Windows 1252.
- Today, it only accounts for 1.4% of the internet traffic.

```
~~~enc-check -8 abcd

~~~
```

---

# Common encoding schemes #

- An encoding scheme will encode the number to one or more bytes.

## Multi byte encoding schemes ##

### UTF - 8 ###

- Variable byte encoding scheme.
- 1 - 4 bytes to represent a unicode code point.
- Backward compatible with ASCII.
- Can represent a maximum number of 2097152 code points.
- 99% of the internet uses this encoding scheme.


 | Byte 1   | Byte 2   | Byte 3   | Byte 4   | Available bits
 |----------|----------|----------|----------|----------------|
 | 0xxxxxxx | -        | -        | -        | 7              |
 | 110xxxxx | 10xxxxxx | -        | -        | 11             |
 | 1110xxxx | 10xxxxxx | 10xxxxxx | -        | 16             |
 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 21             |


```
~~~enc-check -8 abஐह🤔

~~~
```

---

# Common encoding schemes #

- An encoding scheme will encode the number to one or more bytes.

## Multi byte encoding schemes ##


### UTF - 16 ###

- Variable byte encoding scheme.
- 2 or 4 bytes to represent a unicode code point.

```
~~~enc-check -6 abஐह🤔

~~~

```

---

# URL Encoding #

- Applicable only for HTTP traffic.
- Some characters have a special meaning in the url string Eg: &, #, ?
- The url string should also be only in ASCII.
- These characters should be treated differently.

## Steps to URL-encode a string ##

- Encode the string in one of the encoding schemes.
- If a particular character cannot appear in the url string, or is not ASCII, print the hex representation of the string, prefixed with a `%`.


```
~~~enc-check -8 &?

~~~
```

- For example, if the url string `p1&/pw?` were to be url-encoded under utf-8 encoding, then it would be `p1%26/pw%3f`


```
~~~enc-check -6 &?

~~~
```

- Under utf-16 encoding, it would be `p1%00%26/pw%00%3f`

---

# What should be supported in applications? #

- Support Unicode code points encoded as utf-8 characters.
- URL encode under utf-8.


---

# What is a character? #

- It is a group of unicode code points - also called a grapheme cluster.
- Eg: the character 'ப்' consists of 2 unicode code points as seen below.


```
~~~enc-check -8 ப்

~~~
```

- Number of characters in a string is often different from `string.Length`.
- Some languages (eg: python) return the number of unicode code points.
- Some languages (eg: C#) will return the number of utf-16 bytes to encode the complete string.
- The below emoji is of length 1 in python and length 4 in c#.


```
~~~enc-check -6 🤔

~~~
```

- Be careful about advertising character length limitations.