|Codepoint Range||The Unisearcher has every Unicode codepoint between 0x00 and 0xFFFFF. That's over one million different possible characters (that's what a "codepoint" is: a single character), but at present there are really only 81,250 actual characters in the official database. If you're wondering, a "codepoint" means a character or a glyph, everything from A to ず ("zu" in Japanese; see, you're learning already). Codepoints are the numerical representation of a character's position in the Unicode database, usually given in hexadecimal like "0x305A" for our friend ず or "0x41" for "A", but sometimes (like HTML entities) in decimal; ず is 12378 in decimal, 305A in hex. Maybe I should've put a glossary at the start of this help file...|
Information is available for the following characteristics of each codepoint, where available:|
The official counts of everything, from the last 'parser' run:|
|Search Form Fields|
These are the fields across the top of each page. Each searches for a specific thing.
If you point at one of them with the mouse, it'll pop up a little box filled with
what that field searches for, in every language I could find. I'm sure a lot of them
aren't all that accurate, a lot was probably lost in translation (I mean, you
try finding the Vietnamese word for "hexadecimal"), but it should give everyone at
least a little idea of what the field's for. I did it 'cause I thought it'd help,
okay?? Stop hitting me!|
Anyway... here's what each field's for. Bon appétit.
|Character Block||This dropdown field at the top of the page will take you directly to one of the Unicode code blocks.|
|Shift-JIS (hex)||If you want to see all details of a character whose Shift-JIS code you know, put the code here in hexadecimal and submit it. At most, one character can be searched for with this field.|
|Unicode (hex)||If you want to see all details of a character whose Unicode Codepoint you know, put the code here in hexadecimal and submit it. At most, one character can be searched for with this field.|
|Unicode (dec)||This is for the decimal (base-ten) number that's equivalent to a character's hex Codepoint. At most, one character can be searched for with this field.|
|UTF8 Hex Code||
You can enter a hexadecimal number in this field, of up to 255 characters, to display every
character the string of hex you entered represents. It must be valid UTF-8-encoded hex,
in big-endian. For example, to retrieve our "combining form" example above, enter this into
the field (the three UTF-8 codes for the three characters in it, which in this case is a
trio of trios of octets... that is, three pairs of hex digits ["octets"] per character):
E0B897 E0B8B5 E0B988The spaces are in there just for clarity. It'd work just as well with it all smashed together like "E0B897E0B8B5E0B988". Four-octet UTF-8 codes are also supported; F0AFA294 for instance, which is the Codepoint 0x2F894. (Putting "0x" before a number means it's in hexadecimal; this field is even smart enough to remove those and other non-hexadecimal characters first.)
|UTF8 String||This is the Field of Power. You can cut'n'paste up to 255 characters -- ANY characters -- into this field and it'll tell you everything it knows about every one of them, in order... assuming your browser understands the UTF-8 encoding scheme, which every browser these days should understand. If this field doesn't work for you, it might be time to upgrade from Windows '95 to something a tad more... modern. Like Gentoo Linux.|
This is the full, official English name of a character. Every CJK (Chinese, Japanese, Korean) Kanji
character has the description "Kanji Character" or no description at all; those it's best to search for by
their Definition instead. Believe it or not, the character
"a" has the official name "Latin Small Letter A". 〷 is "Ideographic Telegraph Line Feed Separator Symbol",
but I think "Dos Equis" would have been a better name, don't you? ƛ is called "Latin Small Letter Lambda
With Stroke" (and I bet it looks like a little square to you if you're using Internet Explorer™). And so on.
(If I'd have had to come up with names for all these things, I'd have been "With Stroke" too, maybe even "⚩"
(also known as "Horizontal Male With Stroke"), but my font doesn't support that one so who knows.)
Any combination of words can be searched for in here. If you want all the Hiragana characters with an "i" in their Description, put "hiragana letter i" in this box (there's only fifteen of them). Partial words are matched too, and it's not case-sensitive; "hIrA LetT I" works just as well.
|Eng. Definition||Searching in this field yields all the characters which have a definition containing the words (or, again, partial words) you enter. A definition in English. Sorry, world; that's what the Unicode people wrote all the definitions in.|
|Pronunciation||Remember the big list of languages Unisearcher collects pronunciations for? You can search for them here. If you want the character the Koreans pronounce "Phwung", put that into this field and then pick, from among the eleven results it returns, whichever one of them you were looking for. Note! This field does not do partial words! Entering "Phwun" will only return one character, and it won't return "Phwung" ones. Oh; to search for, say, a Cantonese character pronounced as "gei3" (and I don't even really know what that means but God bless the internet [update: it's "gei falling-sound", which is just as funny]), enter "gei-3" into this field, not "gei3". I put dashes in all of them, for both Cantonese and Pinyin. And it's not case-sensitive either. Note: this particular query also returns characters that are pronounced "gei3" in Pinyin. For all I know a Cantonese "gei3" doesn't even sound like a Pinyin "gei3". But all the pronunciations are in the same database field, so what can ya do.|
|Code Page Navigation|
The three or four big yellow rows you see at the top of most pages is where you can quickly select a code block
to zip to if you know its first two or three hex digits. The character whose codepoint is 0x2F894 is in the
"upper" codepoint block because it has five hex digits in it. The codepoint 0x3D4C, having only four
digits, actually starts with "0", making it 0x03D4C. So try to think of them all as having
five digits, left-padded with zeroes, no matter how many digits are in them. It helps. And here's how to
handle each of the number's first three digits; we'll use "0x03D4C" as our example codepoint:|
|Top-Level||This will be "0", as in 0x03D4C. It's the first digit of the five-digit codepoint.|
|Next-Level||This will be "3", as in 0x03D4C. It's the second digit of five.|
|Last-Level||This will be "D", as in 0x03D4C. It's the third digit of five.|
|Block Nav||This provides an easy way to jump forward or backward one single 256-character page, to show another code block (or just the previous/next page in the current codeblock if it's a big one).|
So if you want to get to the page where 0x03D4C lives, you'd click the "0" in top-level, the "3" in
next-level, and the "D" in last-level. Number 4C (out of FF) on the page you'll land on is in the 5th row,
column 12 (or as we programmers like to call it, "The 'C' Column" [since 12 is "C" in hex, go ahead and laugh,
but you'll notice that EVERY character's codepoint in that column ends with "C" -- magic??? ]).
The final two digits, 4 and C, aren't needed for this navigation section; every possible combination of
the last two digits is represented on the page, from 00 to FF (0 to 255 in decimal).|
And, if you hadn't noticed, the three levels for the current page/subpage will have magenta backgrounds. I hope everyone likes the color scheme; if not, play with your own "style.css" file once you download it. If you ever need to go to the page currently highlighted in all three levels, say if you're viewing a single glyph but the nav is still set thanks to 'page' and 'subpage,' just click on whichever "Last-level" digit is already highlighted. Handy.
|Direct URL Links||
If, for some probably-metaphysical reason, you want to bookmark a certain page but don't see it
in the URL bar of your browser because the page resulted from a form being posted or something, you can
manually recreate a URL to that page by setting the right "get variables" in the URL. Continuing our
overworked 3D4C example, the following URL will get you to the full 256-character chart page containing
it. "page" is the top- and next-levels combined (top-level defaults to 0 if "page" is just a single digit),
and "subpage" is the single last-level digit.
index.php?page=3&subpage=D&hilite=3D4C (Link)"page" and "subpage" have to both be set for it to show you a full chart page. "hilite" just makes one of them bright-yellow, to highlight it and catch your eye.
index.php?page=03&subpage=D&glyph=03D4C (Link)That URL takes you to the details page for a single character (what I euphemistically decided to call a "glyph" for reasons not entirely geeky in nature. Just remember, a "glyph" is a hexadecimal codepoint and a single character; it's all synonymous). "page" and "subpage" are actually optional in it; they just tell it how to set the "current" three levels in the navigation. "glyph" if present in a URL triggers showing just the details of a single character.
And for your convenience, there's yet another tool available. Any time you use the search form to find a set of glyphs, a "URL to the results below" link will appear to the left of the "Glyph Entry Key." That link can be bookmarked to quickly retrieve the results of any search you ever perform. You're welcome.
Notice how the top two "levels" are combined into the "page" variable. If page is "5", it's effectively the same as telling it "page=05". You can go all the way up to "page=FF&subpage=F" before you'll make it cry. Actually it won't cry, it'll just reset the page to "0" for you, quietly, so it doesn't embarass you in front of your geek friends.
So you have this tool now; what do you do with it? What can you do with it? Well, "all sorts of things."
|Other Available Things|
|This page will display for you every single defined Unicode codepoint. It can easily bring your browser to its knees, so be patient while it's loading. I recommend disabling "smooth scrolling" before loading it. There are only 81,250 characters defined right now, though, so it's not like it'll be putting over a million characters on your screen. This page is the same thing, but organized by the code blocks rather than codepoints.|
Full Distribution (now! v1.1!) as a gzipped tar. The entire thing is only a
43K archive. It uncompresses to about 212K. There's just not that much to it... until you run the parser. The data files
it downloads take up about 14 meg (not recommended for 56K modems) but you can always delete them afterwards, and the
database space required to hold it all (including indices) is also about 14 meg with MySQL 5.x and native UTF8. Not bad
for such a useful app, I'd say.
|Artist Contact||You can always reach me at firstname.lastname@example.org if you have any problems with the Unisearcher or just want to opinionize about it.|
That I know of. |
:')Well, except maybe support for all possible codepages and character sets. I only know of one thing that can do that; PHP's mb_string functions. I avoided using those because not everyone has them available in their PHP installation. Unisearcher has all the support it needs built into it; no weird modules or extensions are necessary. It should run on any out-of-the-box PHP install.