NOTE: Affiliate links to Amazon products in this post are labelled, and will generate a commission for me.
In the past two posts, we’ve talked about how computer RAM works, and why each location in RAM can only store a single number between 0 and 255. This week, let’s talk a bit about how computers store data which isn’t numeric, and we’ll start with the English alphabet.
Alphabet Soup
For speakers of English, the alphabet contains twenty-six (26) letters, arranged in a certain order. Young school children learn many different songs about it to remember the letters and their order.
Because there is an order, we can say things like:
The letter ‘A’ is the first letter of the alphabet.
And:
The eighth letter of the alphabet is ‘H’.
Put a different way, we can map a relationship between the letters of the alphabet and their numerical position in the list:
- A
- B
- C
- …
- Y
- Z
In fact, this property of the alphabet, sometimes called the A1Z26 cipher, has been used to hide “secret messages” for years. In fact, Ralphie’s hidden message from Little Orphan Annie in one of my favorite holiday movies, A Christmas Story (1983) (affiliate link), is a prime example.
Mapping Letters to Numbers
Because we can map letters to numbers, it makes sense that the computer could do the same thing, right? Well, yes, but there are a few special cases that need to be handled to make it all work.
NOTE: Finding a general relationship, then handling special cases, is normal in computer programming. In fact, it’s so common, computer programmers refer to them as edge or corner cases. Not properly handling edge and corner cases can lead to bugs in your code, which are often not easy to find.
- Our original list of letters only showed upper-case letters, so the first edge case to handle is lower-case letters. We’ll need 26 more numbers to represent the lower-case numbers, for a total of 52.
- Next, there is punctuation. A quick look at your keyboard will show a host of non-letter characters, like the ampersand (&), semi-colon (;), and question mark (?). We’ll need numbers for all of those as well, and there are around 32 of them. So now, we’re up to 84.
- Then there are the numbers themselves, written as individual characters. We’ve got ten of those, zero through nine, so our list grows to 94 numbers .
- Then there are things like a single space between words, a TAB for indenting paragraphs, carriage returns for blank lines, and end of page marks. Collectively, these are called white space, and it turns out there are 6 of them, for a total of 100.
Now, if we arrange all of these in some order that makes sense to us, and map numbers to them, we’ll have a way to store any text information we want in computer memory. All we have to do is convert our text into numbers when storing it in memory, and convert the numbers back to text when retrieving the data. Converting text into it’s numeric format and back is called encoding and decoding the text, and the map showing which letter corresponds to which number is called a character encoding.
However, we don’t need to any of this ourselves, because it’s already been done…
Enter ASCII
Back in the 1960’s when computers were first being used for broad communications, the precursor to the American National Standards Institute (ANSI) worked with computer companies to create ASCII, which stands for American Standard Code for Information Exchange. It specifies a character encoding, showing which numbers correspond to which letters (called printable characters), as well as numbers for invisible control characters which are meant to control devices like printers.
While ASCII worked well for its time, and is still used today, it has it’s shortcomings:
- ASCII was designed specifically for American English text, and had no provisions for non-English characters.
- Local and regional variations on ASCII to overcome missing characters often meant text written in location would not be readable in another.
- Even different computer vendors would use their own variations on ASCII, meaning text written on one computer may not be readable on another in the same location.
Today, most computers use a character encoding called Unicode, which can use one, two, or even four bytes to encode each character. The first 256 entries in Unicode are the same as the ASCII table, which allows older text data to be read. However, because Unicode can use more bytes to store data, it is able to encode characters from different alphabets (such as Cyrillic, Arabic, Chinese, Hindi, and many others) at the same time.
Now, knowing we can store individual letters in memory, how do we store complete words, sentences, paragraphs, and even whole documents?
Memory Blocks
When I first started talking about computer RAM, I mentioned that each location in RAM has an address, starting with zero and increasing to the maximum size of memory. This implies that the memory byte at address 100 is next to the byte at address 101, which is next to address 102, and so on. I also mentioned that when you need to store data, you ask the computer for memory in which to store it, and it gives you the address of that memory.
So what happens when you ask the computer for enough memory to store the word “Hello”?
It turns out the computer gives you the memory address of the first byte in a block of memory. That block is large enough (five bytes) to store the entire word:
100 | 101 | 102 | 103 | 104 |
---|---|---|---|---|
H | e | l | l | o |
Since you know the address of the first byte of the block, you can always find the address of any part of the block through addition. The address of the “H” is 100, the address of the “e” is 100+1, the address of the first “l” is 100+2, and so on.
So if we want to store “Hello” in a variable called greeting
, our request for memory looks like this:
- Your program asks the computer for a place to store the word “Hello”.
- The computer gives your program the address of the first memory location in a five byte block (called a base address).
- Your program stores the base address in a table, and labels it as
greeting
. - Your program then tells the computer to store each character of “Hello” in sequential memory locations, starting with the base address it returned in step 2.
Of course, this presumes we are using ASCII encoding, which as mentioned above is no longer common. Most modern computers use Unicode, which can use multiple bytes to represent every character. This means the block returned for Unicode text might be larger, and each character can take up multiple bytes instead of just one.
What we still don’t know is how the computer stores really big numbers. That’s what we’ll dive into next time.