The Unicode Searcher:

An Explanation


Hi. If you're here, it's probably because you want to know what makes this thing tick and perhaps how to get your very own installation of it. I'll explain that in a moment. First, why does it exist? Because I got tired of it taking forever to put Japanese characters into web pages. The most reliable method of doing so is to put a multi-byte HTML character entity code on the page (e.g., "シ" for the Katakana letter "shi": シ) but there are other ways too; Shift-JIS is a whole character set unto itself, and it's what ends up on most pages made by actual Japanese people. But Shift-JIS requires all kinds of bizarre things, and demands that codeset be installed into your browser, and including a META charset tag in the page's header, etc etc. The HTML entity requires none of that; it is the decimal representation of a two-byte hexadecimal Unicode codepoint. Every browser these days has most of the Unicode characters implemented in it (some more than others) and it's pretty easy to add support for things you don't have; Arabic, Thai, Japanese, Korean, etcetera ad infinitum. The Unicode codes for something will never change.

The Shift-JIS codes are all different from the Unicode numbers. "shi" is 0x30B7 in Unicode, but 0x8356 in Shift-JIS. (The "0x" before something just means "this is a hexadecimal number.") Fortunately, the Unicode Consortium provides standardized, parseable text files which tell how to convert a Shift-JIS code into Unicode, list the actual name of most every Unicode character, even definitions and pronunciations for them. It is from there that my little searcher thingy gets its information. The problem I was running up against was that any time I wanted to put something Japanese into a web page, I had to perform many steps for each character. The program I use to look up Kanji in ("JquickTrans") lets you copy a character from it to the clipboard, but it'll only do it as Shift-JIS. I don't want to put Shift-JIS into web pages, as I explained earlier. So I had to (in Windows) copy it to the clipboard, save it into a text file, and then (in Unix) do a hexdump on the file to get the Shift-JIS hexadecimal codes for that character (and since it's displayed in "little endian" order, I also have to then reverse the order the two bytes are in), then look in the Unicode-provided mapping file to find the Unicode codepoint for that Shift-JIS code, then convert that into base-10 decimal, and then put "&#" in front of it and ";" after it, and only then could I put it in a web page. Douglas Adams once said something about how he derived great pleasure from spending all day creating a computer program that would automate a task that takes him ten seconds to do by hand. I suppose I'm no different... and thus, the searcher was born.

The application I created to do all this dirty work for me consists of three pieces: the actual search page, the mapping/descriptions data stored in MySQL, and a parser that creates this data set from the five Unicode-provided mapping files. This way, if Unicode updates one of their data files, all you have to do is delete your copy of it, run the parser, and you're up-to-date. The parser takes less than two minutes to run on my machine (yours too, probably).

The code charting page shows all 256 characters in the current block, whether they exist or not. Each block is color-coded so you can tell them apart. Ones that appear grey are undefined blocks. If you hover your mouse over any character, a little popup tooltip kinda thing will appear that lists everything there is to know about that character. Makes the pages huge (over 100K each), but oh well... it's too handy to not keep it even though it's kinda slow this way.

There are several kinds of searching available. You can look for a certain Shift-JIS hex code, a decimal Unicode codepoint, the description of the character you want, the English definition of the character you want, or the Japanese pronunciation of the character you want. Note that the descriptions of Katakana and Hiragana Japanese characters are a bit unstandard; for example, the "shi" I mentioned above is instead called "si". "Tsu" is called "tu", "fu" is "hu", etc. I hope you know what I mean by this. (It also appears to have characters I didn't even know existed, "wi" and "vu" for instance, which I had no idea were even part of their alphabet.) So to find the Katakana version of the "tsu" character, you'd put "kata letter tu" in the search box.

If you search for an English definition of something, you'll get much more than just Japanese. You can also get Korean, Vietnamese, Cantonese, and Mandarin out of it, including pronunciations, which are available for all of those languages.

There is sadly no way to search for words comprised of more than one Kanji character. If you want to find, say, "baka," you'll have to search for the two Kanji that make it up separately (馬 [uma] and 鹿 [shika]) and then put them together. And it took me about thirty seconds to find those two Kanji and put them in this page using the Searcher. :) The same thing applies to words that are a combination of Kanji and Hiragana. Sorry. Maybe someday.

"Wow," you say, "this is great! Where do I get one?" Well, simply download this distribution of it, unpack it somewhere, and read the INSTALL file. It shouldn't take more than five minutes to install if you already have a working apache/php/mysql server set up.

And I think that about covers it. If you have questions, problems, praise, or job offers, you can email me and I'll see what I can do.