How to configure your web application to correctly deal with character set/encoding issues

H

Background

I'll just link to the best places i've found to read up on the background. Read these first if you want to understand whats going on:

The Details

In order to get your website working harmoniously with one character set, you first have to pick one. I picked ISO-8859-1 (sometimes referred to as latin1). It's the most popular English (Latin) language set, though it doesn't display many foreign characters.

Unfortunately, despite the fact that Unicode is the end-and-be-all PHP is really horrible at dealing with Unicode (specifically multi-byte strings). Since I've got enough to worry about without having to also check every function i'm using for multi-byte compatibility, ISO-8859-1 is right for me, for now. I hear PHP6 will fully support mb strings.

In order to get your web application working correctly with one character set, there are basically three parts you must take care of:

  1. The database
  2. The web server
  3. The web page itself

Database config

In a perfect world, you would compile your database to default to your preferred character set. I won't cover that here.

Assuming you cannot do that, you need to create your tables using the correct set. Keep in mind, at least on MySQL, the character set can be configured all the way down to the column level. Here we will just set it at a database level. (See here for much more detail):

CREATE DATABASE database CHARACTER SET utf8 COLLATE utf8_general_ci;

Next, we need to configure how the database delivers the results of queries to the calling application. Assuming you don't (or can't) ensure this is done at compile time, you can use the following command before sending a query:

SET NAMES charset

According to the linked article above, this is shorthand for setting the character_set_client, character_set_results, and collation_connection variables in MySQL. A good place to put this query is in the constructor of your database access class.

Web Server

Assuming you are using apache, you need to edit a setting in your httpd.conf file (the primary apache config file).

AddDefaultCharSet charset

Set this to be what you prefer apache to tell the client that the default character set is in case one is not specified. (This is important, as in some cases the client will trust this even if one is specified in the document.)

Also, there are options in whatever server side language you're using to set this variable in outgoing headers. In PHP you use the header() function. See the PHP documentation for details.

The web page

Finally, you should specify the character set of the page you're serving with a meta tag at the top below the tag. It looks like this:


<meta http-equiv="Content-Type" content="text/html; charset=charset" />

Getting obsessive

Most browsers support specifying the character set that a form should accept (and what the browser will convert the text to if it is not correct). To manually set this, include the following attribute in the tag:

 accept-charset='ISO-8859-1'

Doing the above should ensure that you operate using the same character set throughout your web application, and hopefully garbage characters and question marks will be forever in the past.

Reference

About the author

Jeremy Tunnell
I study meditation and write some software.

Comments

Get in touch

You can reach Jeremy at [email protected]