Unisearch's Informative Information Guide

Capabilities
Codepoint Range The Unisearcher has every Unicode codepoint between 0x00 and 0xFFFFF. That's over one million different possible characters (that's what a "codepoint" is: a single character), but at present there are really only 84,984 actual characters in the official database. If you're wondering, a "codepoint" means a character or a glyph, everything from A to ず ("zu" in Japanese; see, you're learning already). Codepoints are the numerical representation of a character's position in the Unicode database, usually given in hexadecimal like "0x305A" for our friend ず or "0x41" for "A", but sometimes (like HTML entities) in decimal; ず is 12378 in decimal, 305A in hex. Maybe I should've put a glossary at the start of this help file...
Character Info Information is available for the following characteristics of each codepoint, where available:
  • Codepoint: This is a hexadecimal number unique to a certain character, sort of its "serial number."
  • Description: Not every character has an English description (not to be confused with a definition), but most do.
  • Shift-JIS: A character's Shift-JIS hex code, if it has one, to use when encoding characters in that charset.
  • Definitions: The English definition(s) for a Kanji character, or "CJK" (Chinese, Japanese, and Korean).
  • Code Block: The Unicode "code blocks," logical groups of related characters, are shown for all characters.
  • UTF-8 Hex Code: This is calculated dynamically from the hex Codepoint itself and is shown for all characters.
    If you're interested, here's the algorithm to convert a character number to UTF8:
    if(while(!X_X!) & 2){z{0.O}(@-<--)not}do{!#$1-99/xx/ff}i, x{y}[2](j)!1:3
    ...where "z" is the number. And yes, that was a joke. It's far too complicated to show here. Made ya look...
  • Pronunciations: How to pronounce a character is available for the following languages:
    • Japanese "Kun" in Romaji form
    • Japanese "On" in Romaji form
    • Hanyu Pinyin, an English approximation of Chinese syllables
    • Cantonese, which somehow differs from the other Chinese above
    • Vietnamese
    • Korean
  • Combining Forms: A "combining form" is a character that modifies another character. For example, the "character" ที่ is actually three different characters all smashed together in the same place. I can't show a combining form by itself, really, but you can see how it works in this progression:
    ท ที ท่ ที่
    That's Thai, by the way.
    This and this will show you the different pieces. Those three characters are each stored by themselves in the database. If you enter a UTF-8-compatible string containing combined characters like that one into the "UTF8 String" form field, it will return each actual character in it separately. This shouldn't hinder you finding them or converting UTF-8 strings or even long hex-code conglomerations into their correct combined form.

    Note how the middle character, enigmatically called "Sara Ii", doesn't show up as a combining form character even though it is. That's because the Unicode database is imperfect. It's hard enough as it is finding all of them, and I was bound to miss some despite using about three different finding algorithms to identify them. So that explains that. It's not my fault. Really.
    Another fun example. You can make a big tilde with two combiners by themselves, like this: ︢︣ Isn't that ugly? I don't know why they did that. Anyway, combiners don't always require another character to latch onto to become visible, but if there's a compatible one available, it will. I say "compatible" because combiners are picky about what they combine with. You can't mix our Sara Ii with a Latin "A" because they're from different alphabets, or something. I don't know the full logic behind doing that, or the full rules per RFC or whatever. I just went through every block with combiners in it, picked out the simplest character that they could all combine with, and defined a lookup table that returns what character to combine a combiner with. It's technical. Let's just say it was a lot of work fully supporting combiners in Unisearch, and it's still not perfect because the data I have to work with isn't perfect. If someone knows of a better data file to get combiners from than their "UnicodeData.txt" file, let me know.

    Anywhere you see a character displayed in red, it means you're actually viewing a combining-form character combined with some other character that it's compatible with. If I didn't do this, some combining characters would simply be invisible.
Statistics The official counts of everything, from the last 'parser' run:
  • 67,446 Total Characters
  • 31,915 Character descriptions
  • 11,399 ShiftJIS-to-Unicode mappings
  • 20,925 English definitions
  • 65,675 Various pronunciations
  • 196 Character grouping blocks
  • 978 Combining Characters
Search Form Fields
These are the fields across the top of each page. Each searches for a specific thing. If you point at one of them with the mouse, it'll pop up a little box filled with what that field searches for, in every language I could find. I'm sure a lot of them aren't all that accurate, a lot was probably lost in translation (I mean, you try finding the Vietnamese word for "hexadecimal"), but it should give everyone at least a little idea of what the field's for. I did it 'cause I thought it'd help, okay?? Stop hitting me!

Anyway... here's what each field's for. Bon appétit.
Character Block This dropdown field at the top of the page will take you directly to one of the Unicode code blocks.
Shift-JIS (hex) If you want to see all details of a character whose Shift-JIS code you know, put the code here in hexadecimal and submit it. At most, one character can be searched for with this field.
Unicode (hex) If you want to see all details of a character whose Unicode Codepoint you know, put the code here in hexadecimal and submit it. At most, one character can be searched for with this field.
Unicode (dec) This is for the decimal (base-ten) number that's equivalent to a character's hex Codepoint. At most, one character can be searched for with this field.
UTF8 Hex Code You can enter a hexadecimal number in this field, of up to 255 characters, to display every character the string of hex you entered represents. It must be valid UTF-8-encoded hex, in big-endian. For example, to retrieve our "combining form" example above, enter this into the field (the three UTF-8 codes for the three characters in it, which in this case is a trio of trios of octets... that is, three pairs of hex digits ["octets"] per character):
E0B897 E0B8B5 E0B988
The spaces are in there just for clarity. It'd work just as well with it all smashed together like "E0B897E0B8B5E0B988". Four-octet UTF-8 codes are also supported; F0AFA294 for instance, which is the Codepoint 0x2F894. (Putting "0x" before a number means it's in hexadecimal; this field is even smart enough to remove those and other non-hexadecimal characters first.)
UTF8 String This is the Field of Power. You can cut'n'paste up to 255 characters -- ANY characters -- into this field and it'll tell you everything it knows about every one of them, in order... assuming your browser understands the UTF-8 encoding scheme, which every browser these days should understand. If this field doesn't work for you, it might be time to upgrade from Windows '95 to something a tad more... modern. Like Gentoo Linux.
Description This is the full, official English name of a character. Every CJK (Chinese, Japanese, Korean) Kanji character has the description "Kanji Character" or no description at all; those it's best to search for by their Definition instead. Believe it or not, the character "a" has the official name "Latin Small Letter A". 〷 is "Ideographic Telegraph Line Feed Separator Symbol", but I think "Dos Equis" would have been a better name, don't you? ƛ is called "Latin Small Letter Lambda With Stroke" (and I bet it looks like a little square to you if you're using Internet Explorer™). And so on. (If I'd have had to come up with names for all these things, I'd have been "With Stroke" too, maybe even "⚩" (also known as "Horizontal Male With Stroke"), but my font doesn't support that one so who knows.)

Any combination of words can be searched for in here. If you want all the Hiragana characters with an "i" in their Description, put "hiragana letter i" in this box (there's only fifteen of them). Partial words are matched too, and it's not case-sensitive; "hIrA LetT I" works just as well.
Eng. Definition Searching in this field yields all the characters which have a definition containing the words (or, again, partial words) you enter. A definition in English. Sorry, world; that's what the Unicode people wrote all the definitions in.
Pronunciation Remember the big list of languages Unisearcher collects pronunciations for? You can search for them here. If you want the character the Koreans pronounce "Phwung", put that into this field and then pick, from among the eleven results it returns, whichever one of them you were looking for. Note! This field does not do partial words! Entering "Phwun" will only return one character, and it won't return "Phwung" ones. Oh; to search for, say, a Cantonese character pronounced as "gei3" (and I don't even really know what that means but God bless the internet [update: it's "gei falling-sound", which is just as funny]), enter "gei-3" into this field, not "gei3". I put dashes in all of them, for both Cantonese and Pinyin. And it's not case-sensitive either. Note: this particular query also returns characters that are pronounced "gei3" in Pinyin. For all I know a Cantonese "gei3" doesn't even sound like a Pinyin "gei3". But all the pronunciations are in the same database field, so what can ya do.
Code Page Navigation
The three or four big yellow rows you see at the top of most pages is where you can quickly select a code block to zip to if you know its first two or three hex digits. The character whose codepoint is 0x2F894 is in the "upper" codepoint block because it has five hex digits in it. The codepoint 0x3D4C, having only four digits, actually starts with "0", making it 0x03D4C. So try to think of them all as having five digits, left-padded with zeroes, no matter how many digits are in them. It helps. And here's how to handle each of the number's first three digits; we'll use "0x03D4C" as our example codepoint:

Top-Level This will be "0", as in 0x03D4C. It's the first digit of the five-digit codepoint.
Next-Level This will be "3", as in 0x03D4C. It's the second digit of five.
Last-Level This will be "D", as in 0x03D4C. It's the third digit of five.
Block Nav This provides an easy way to jump forward or backward one single 256-character page, to show another code block (or just the previous/next page in the current codeblock if it's a big one).
So if you want to get to the page where 0x03D4C lives, you'd click the "0" in top-level, the "3" in next-level, and the "D" in last-level. Number 4C (out of FF) on the page you'll land on is in the 5th row, column 12 (or as we programmers like to call it, "The 'C' Column" [since 12 is "C" in hex, go ahead and laugh, but you'll notice that EVERY character's codepoint in that column ends with "C" -- magic??? ]). The final two digits, 4 and C, aren't needed for this navigation section; every possible combination of the last two digits is represented on the page, from 00 to FF (0 to 255 in decimal).

And, if you hadn't noticed, the three levels for the current page/subpage will have magenta backgrounds. I hope everyone likes the color scheme; if not, play with your own "style.css" file once you download it. If you ever need to go to the page currently highlighted in all three levels, say if you're viewing a single glyph but the nav is still set thanks to 'page' and 'subpage,' just click on whichever "Last-level" digit is already highlighted. Handy.
Advanced Topics
Direct URL Links If, for some probably-metaphysical reason, you want to bookmark a certain page but don't see it in the URL bar of your browser because the page resulted from a form being posted or something, you can manually recreate a URL to that page by setting the right "get variables" in the URL. Continuing our overworked 3D4C example, the following URL will get you to the full 256-character chart page containing it. "page" is the top- and next-levels combined (top-level defaults to 0 if "page" is just a single digit), and "subpage" is the single last-level digit.
index.php?page=3&subpage=D&hilite=3D4C (Link)
"page" and "subpage" have to both be set for it to show you a full chart page. "hilite" just makes one of them bright-yellow, to highlight it and catch your eye.
index.php?page=03&subpage=D&glyph=03D4C (Link)
That URL takes you to the details page for a single character (what I euphemistically decided to call a "glyph" for reasons not entirely geeky in nature. Just remember, a "glyph" is a hexadecimal codepoint and a single character; it's all synonymous). "page" and "subpage" are actually optional in it; they just tell it how to set the "current" three levels in the navigation. "glyph" if present in a URL triggers showing just the details of a single character.

And for your convenience, there's yet another tool available. Any time you use the search form to find a set of glyphs, a "URL to the results below" link will appear to the left of the "Glyph Entry Key." That link can be bookmarked to quickly retrieve the results of any search you ever perform. You're welcome.

Notice how the top two "levels" are combined into the "page" variable. If page is "5", it's effectively the same as telling it "page=05". You can go all the way up to "page=FF&subpage=F" before you'll make it cry. Actually it won't cry, it'll just reset the page to "0" for you, quietly, so it doesn't embarass you in front of your geek friends.
Serving Suggestions So you have this tool now; what do you do with it? What can you do with it? Well, "all sorts of things."
  • Make Web Pages - like this one. My style is to just use the HTML entity for a character, like, instead of actually embedding the binary string for that combined Thai character up there, I used the three codepoints in decimal like this: &#3607;&#3637;&#3656;. You can see that in the view-source for this page. But I also could have just embedded that binary like this... ที่ ...and it works just as well, and takes up only nine bytes instead of twenty-one like the entities version, because of something I put into the HTML header for every page:
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
    
    Forcing browsers to use UTF-8 for a page is the best way to make sure everyone can view anything you put on it... as long as it's UTF-8. The HTML entities, like &#3607;, are converted into the right UTF-8 code by the browser and displayed. I think. It's a bit fanciful and magical, something to do with wood nymphs or something. Anyway, UniSearcher can help you find the various codes you need to display those characters as long as you know how to search for them. Usually it's just as easy as pasting or typing a string into the "UTF8 String" field, hitting enter, copying it all to the clipboard, finding the format you wanted to display those characters with, and cutting'n'pasting it into whatever you're writing. Your "operating system" should know how to paste things from its clipboard into web page forms, converting it all itself to match that "UTF8" charset declaration on the page, so that should work for anyone... I hope. And if you're the kind of programmer who hates trying to find and use editors that understand multibyte strings, like I am, using the HTML entities is the way to go. That's why the lang.phps file uses entities instead of actual MB strings even though it takes up so much more space. I'm not lazy; I'm just addicted to pico. I imagine "vi" users will sympathize with me; I have no idea if vi supports MB or not. But the allchars*.php pages do use actual UTF8 characters instead of HTML entities; that's why they're only 450K instead of about a meg. UniSearcher makes it easy to use whichever method you want, is the point.

  • Define Text in Programs - The UniSearcher makes it a breeze for programmers using just about any programming or scripting language to create MB strings inside programs. It'll give you "\x20" for a space, for instance, and that can, I believe, be slapped in as-is to create an MB string. It can in PHP and perl, at least. The URL form is also provided; "%20" for a space, for instance, or "%E0%B8%97%E0%B8%B5%E0%B9%88" for our combined Thai. Big- and little-endian versions of the UTF8 hex code are also shown. All you have to do is put something on its clipboard and all secrets are revealed. If I missed any desired output format, just let me know.

  • Amaze Your Friends - if they're geeks. It's not just a tool for geeks, though. I can just about translate Chinese web pages now by just cutting and pasting from the page into Unisearcher and then reading the definitions for all the characters one after the other. It's made going through my daily spam dosage a whole new experience. Spam in foreign languages is fun now. Educational. Imagine the looks on your friends' faces when you tell them it's an email for tiny paper lanterns hand-made by children in Chengdu! Imagine their awe! Now stop daydreaming and get to work.

  • Save the World - Hey, it could happen. Language and religion are the only two things standing in the way of uniting the world. Well... and money and power, but I won't go there.
Other Available Things
All-Characters
Display Pages
This page will display for you every single defined Unicode codepoint. It can easily bring your browser to its knees, so be patient while it's loading. I recommend disabling "smooth scrolling" before loading it. There are only 84,984 characters defined right now, though, so it's not like it'll be putting over a million characters on your screen. This page is the same thing, but organized by the code blocks rather than codepoints.
Source Code Full Distribution (now! v1.1!) as a gzipped tar. The entire thing is only a 43K archive. It uncompresses to about 212K. There's just not that much to it... until you run the parser. The data files it downloads take up about 14 meg (not recommended for 56K modems) but you can always delete them afterwards, and the database space required to hold it all (including indices) is also about 14 meg with MySQL 5.x and native UTF8. Not bad for such a useful app, I'd say.

Individual scripts:
Artist Contact You can always reach me at [email protected] if you have any problems with the Unisearcher or just want to opinionize about it.
Unavailable Things
Nothing That I know of.
:')
Well, except maybe support for all possible codepages and character sets. I only know of one thing that can do that; PHP's mb_string functions. I avoided using those because not everyone has them available in their PHP installation. Unisearcher has all the support it needs built into it; no weird modules or extensions are necessary. It should run on any out-of-the-box PHP install.