Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Banish missing glyphs with Unifont (shkspr.mobi)
105 points by edent on April 4, 2019 | hide | past | favorite | 100 comments


Indeed, GNU Unifont looks awful to many people, and even we only talk about bitmap fonts, Unifont is not the best-looking one. However, it's a font designed for practicability, not aesthetics. It's one of those few fonts that covers the entire Basic Multilingual Plane, and is actively maintained. It's a perfect choice as a fallback font.

And in fact, the font actually looks pretty good under a CLI console. I once tried a Linux kernel patch that embedded the 10 MiB+ Unifont to the native Linux VT console, and I got a perfect multilingual console (without using a 3rd-party framebuffer-based or KMS-based console), including all the CJK characters. It's also a good choice for a dot-matrix LCD/LED display.


> I got a perfect multilingual console

No, you didn't. If it was using Unifont, there are many scripts and languages that it was unable to display correctly because they require "shaping" that Unifont cannot support; it lacks the necessary glyphs, let alone the OpenType (or equivalent) tables to control the rendering.

Displaying one nominal glyph per Unicode character does not result in readable text for many languages. Just because it seems adequate for Western alphabets or for CJK doesn't mean it is sufficient for true multilingual support.


> Displaying one nominal glyph per Unicode character does not result in readable text for many languages

Completely unreadable, even to people with a basic understanding of the inner workings of Unicode and some practice with the font, or just not matching the norms of correctness?

Decipherable but clearly not correct would mark the sweet spot between being just as bad as placeholders and being so good that it undermines efforts for having a proper font for a given language.


Unreadable.

Arabic has a strong difference between letters of the same word, and different words. There is no exact equivalent in English...

b&u&t&i&t&m&i&g&h&t&b&e&s&o&m&e&t&h&i&n&g&l&i&k&e&t&h&i&s


If that example is representative, I'd definitely put it in the "readable" bin. Leaps and bounds from what you'd want to read, but infinitely more useful than a sequence of identical placeholders. Even just being able to tell wether two strings might be identical or not would be an improvement over blank placeholders.


There's also complex text layout (vowel-sign reordering) for Indic scripts, where some (not all) vowel signs require repositioning from phonetic order to visual order. Adding just that aspect to your example (without other complications like positioning of matras or conjunct consonant glyphs) makes it something closer to:

b&t&u&t&i&m&g&h&t&i&b&s&e&m&o&t&e&h&n&g&i&l&k&i&t&e&h&s&i


The degree of badness varies among languages/scripts. In some cases, yes, most readers would be able to decipher the intended content. In others, no, it really would be unreadable, except to those geeky enough to also comfortably "decipher" things like \x74\x68\x69\x73. Would you be happy routinely reading your terminal text in that form?


> it really would be unreadable, except to those geeky enough to also comfortably "decipher" things like \x74\x68\x69\x73

Can you give an example?


Try something in Devanagari script such as "क्षत्रिय" and see how it looks with Unifont. The results may vary, depending whether some kind of shaping engine (like Uniscribe, Pango, HarfBuzz, etc) gets involved in the process, but I'm confident it will be far from correct, and most readers would struggle to recognise it.


Try online here: https://fontlibrary.org/en/font/gnu-unifont

It does look different (likely incorrect), but is definitely better than blank boxes and question marks.


> definitely better than blank boxes and question marks

I disagree, I think it's better for it to be apparent when text is wrong or unreadable, even to people not literate in the language.

I wondered if the problem had to do with Unifont being monospaced but I pasted "क्षत्रिय" in on the Inconsolata page and it appears correct (i.e. the lines correspond to how they look in the font on this page).

https://fontlibrary.org/en/font/inconsolata


Inconsolata doesn't support devanagari at all, so when you try that, you're seeing whatever fallback font your browser chooses.

(IMO, the fact that fontlibrary.org doesn't indicate this is a serious defect in its presentation.)


Good point. fontlibrary.org lists each font's language support in the sidebar but since the page emphasizes the display of the font, it would be better if they prevented such fallback.


Its idea of "language support" is also less than ideal, as it just considers the character repertoire and not whether the font can actually render the characters in a typographically correct way. Thus, it claims Unifont has "support" for various South and South-East Asian languages where exactly this issue applies: the text is simply wrong when displayed linearly using a 1:1 character/glyph mapping.


How is gibberish or incorrect better than blank boxes and question marks? At least with blank boxes and question marks it is obvious that there is a problem. Gibberish would be useless to a reader who knows the script, and might cause someone (such as a developer) that doesn't know the script to think that there is no problem.


Exactly! In the worst case, some hapless person might be using Google Translate to show a tattoo artist what he wants to be memorialized on his chest. With this font. Perhaps with the blank box he'd know something was wrong.


On the other hand, if you're taking something from Google Translate and tattooing it on your chest, whether or not GNU Unifont is 100% accurate in its glyph rendering is the least of your problems, surely.


I think your example can be illustrated by adding spaces between the characters like so: क्षत्रिय vs क् ष त्रि य (note how the first "glyph" decomposed into two drastically different characters)


Actually, your "decomposed" version still has clusters that are different from the individual characters making them up. There's क्, which is क plus ् (OK, that one's easy); more significantly, you've left त्रि as a single unit whereas it is made up of त + ् + र + ि.

So "spelled-out", क्षत्रिय is made up of क ् ष त ् र ि य . I don't know if Unifont tries to render the combining characters as zero-width overstrikes, or what, but much more than that is needed for a readable result.


That looks plenty readable. Ugly as hell, sure, but not any worse then 3N6|_15H \/\/R1773N 45 1337 5P34|<, for example. It's a vast improvement over [0915][094D][0937][0924][094D][0930][093F][092F].


The problem is that the vowel ` ि` is in a misleading position in the decomposed form. It implies a spelling of kshatrayi instead of kshatriya, unless you happen to know that Unicode requires it to be there and the font engine is supposed to move it to its correct place in the ligature.

(This comment is a good equivalent of the problem in English. https://news.ycombinator.com/item?id=19575942 )


Wow thanks for spelling that out. That seems like a challenging script!


To add to this, here's a few example complex text layout languages in no particular order: Thai, Khmer, Arabic, Hindi, Hebrew, Burmese, Tibetan, Punjabi, Sanskrit, and many many others.

Many millions (billions?) of people won't be able to use your terminal in their language.


Also ancient language like Egyptian Hieroglyphics.


I use GNU Unifont, with a user-space virtual terminal. Some of GNU Unifont has looked far from "pretty good". I overlay K16-1990, 9x15, and Ubuntu Monospace on top of it for better-appearing characters.

* http://jdebp.uk./Softwares/nosh/user-vt-screenshots.html#Uni...

* http://jdebp.uk./Softwares/nosh/guide/terminal-resources.htm...

Your multilingual console is really far from "perfect". Aside from the aesthetics of GNU Unifont, you do not have good multilingual input, which you have completely left out. You don't have an ISO/IEC 9995 common secondary group. You do not have CJKV input methods.

You still have problems with modifier keys becoming stuck "on" because of VT switching races.

* https://unix.stackexchange.com/a/494685/5132

And you also still have ISO/IEC 2022 8-bit character set switching for messing things up. (-:

* https://unix.stackexchange.com/a/506140/5132


Because English is already well suited to it, is it incredibly ethnocentric of me to believe that instead of trying to shoehorn the complexities and eccentricities of existing written text into the digital medium we should have instead adapted the writing systems to the medium so they'd be simpler to both represent and input? Because every time I look at what it takes to properly support unicode I'm pretty damn thankful ASCII is more than sufficient to represent my language.


The point of our machines is to help humans, not the other way around.

I was saddened when Spain changed Spanish sorting to make life easier for PC programmers in the 90s. Within only a couple of years that effort became pointless, but was now encoded in law.


> The point of our machines is to help humans, not the other way around.

While this is true, significant amounts of friction are caused by being unwilling to adapt to the medium's nature. What would writing look like if we had tried to encode sound waves directly onto paper? Instead we adapted our communication to the medium.


With such a pliable device I consider every adaptation a person must make to be a deficiency. Perhaps a not-yet-addressable deficiency, but a negative nonetheless.


What did they change?


I guess this refers to this:

”Spanish treated (until 1994) "CH" and "LL" as single letters, giving an ordering of cinco, credo, chispa and lomo, luz, llama. This is not true any more since in 1994 the RAE adopted the more conventional usage, and now LL is collated between LK and LM, and CH between CG and CI. The six characters with diacritics Á, É, Í, Ó, Ú, Ü are treated as the original letters A, E, I, O, U, for example: radio, ráfaga, rana, rápido, rastrillo. The only Spanish-specific collating question is Ñ (eñe) as a different letter collated after N.”

(Copied from https://en.wikipedia.org/wiki/Alphabetical_order#Language-sp...)


Indeed, and the justification at the time was because contemporary software got it wrong. I remember it well.


It's not ethnocentric, it's programmer-centric. If you consider supporting Unicode hard, imagine the effort required to get everyone to learn a new writing system and transliterate all existing works into it. At least Unicode support only requires a handful of specialists to create the necessary tooling, rather than some global uprooting of existing traditions.


I think you over estimate the cognitive effort required. Just look at how quickly emoji were adopted into every-day communication. Maybe you can't transliterate Shakespeare into emoji very effectively, but some meaning is lost just reading it in modern English anyway and we could reserve the more complex systems for this kind of preservation.


Emoji work because they're optional. If people were forced to communicate exclusively in emoji to use some particular software, most potential users aren't going to bother with the effort and will choose a competitor instead.


I think I've expressed my point unclearly: emoji are an adaptation of our communication patterns to the medium. Real time textural communication could not express things we were used to expressing in other real time media via tone of voice or body language, so we adapted how we communicate those things to the media. Indeed writing itself is an adaptation of language to a new medium.


I suggested some falsehoods that programmers believe about telephones a while back (https://news.ycombinator.com/item?id=19215636). There is a case for falsehoods that programmers believe about ASCII, starting with:

* It's 8-bit. (It's 7-bit.)

* It is sufficient for English. (Ð ð Þ þ)

* It is sufficient for Modern English. (zoölogy coöperate £ née resumé)

* It at least has everything that one could type on a typewriter. (½ ¼ ¢ , and that's just starting with some contemporary 20th century U.S. typewriters from IBM such as the Selectric)

* It is sufficient for the U.S., at least. (Not for the U.S. Library of Congress in 1969 it wasn't. It needed 174 characters for catalogue cards, per https://link.springer.com/article/10.1007%2FBF02404378 .)

* It was used by Multics. (Multics used a modified version, padded out with leading zero bits and replacing DEL with PAD. See http://web.mit.edu/saltzer/www/publications/multics/bc-2-01.... .)

* It does not have broken vertical bar. (It was in the 1968 standard at 124, a broken bar because the PL/I language people did not want their unbroken vertical bar to be in the then "national variant" range. https://groups.google.com/d/msg/comp.infosystems.www.authori... http://jkorpela.fi/latin1/ascii-hist.html#7C )

* It does have broken vertical bar. (It was replaced in the 1977 standard by an unbroken vertical bar at 124. Then some subsequent 8-bit character sets in the 1980s re-introduced a broken vertical bar in a second position, with much ensuing hilarity.)

* It did not have arrows. (In the original standard ↑ and ← were where ^ and _ now are, for some consequences of which see https://retrocomputing.stackexchange.com/a/9201/1932 .)

* It is the same as ECMA-6. (ECMA-6:1991 allows national-language variants in several code positions. The standard ASCII characters in those positions are merely the "International Reference Version" and one possibility from amongst several in ECMA-6. See https://www.ecma-international.org/publications/standards/Ec... .)


So in actual fact English was adapted to the medium, albeit minimally.


The end result of adapting our language to the medium is to lose in expressibility and gain in portability. Forcing technology to catch up is a more coherent option in the long run


That's the exact same trade off we made for writing in the first place. Yet here we are still conversing with text despite the technology to record and send each other video having been widely available for a decade or more.


That's specious reasoning.

If you mean, body language is the main or majority channel for communication, well, how much body lanuage matters to communication is contested and differs based on context anyway (see: https://www.psychologytoday.com/us/blog/beyond-words/201109/...).

If you mean, there is a loss of expressibility going from the spoken word to the written word. Is there? You would have to prove that that loss is as significant as the symbolic/representational destruction you are advocating. Much of the imperfect synonyms we have developed in english are used to signal tone and intent -- things that are usually signalled in body language. So, just because we are using the written word does not mean we cannot express those things. You have something in a larger space, you can represent it in a smaller space using longer and more complex series of glyphs.

> Yet here we are still conversing with text despite the technology to record and send each other video having been widely available for a decade or more.

Yes, for the very reason that it erases the extra data. you not being able to see me means that you have to go on my words alone, rather than making extra-contextual judgements about me as a whole based on my physical appearance. That seems only to cement my point, though.


Several of these points can be boiled down to "When people say ASCII, they really mean ISO 8859-1 or CP 1252".


>including all the CJK characters.

Is there a version of GNU Unifont for Japanese kanji? I could not find any. A few Japanese kanji share the same code-point as Chinese characters despite not being the same.

https://i.imgur.com/DYqJTvN.png

https://en.wikipedia.org/wiki/Han_unification


It looks like your "perfect multilingual console" here rubbed people the wrong way—the Arabic ligatures are a straightforward counterexample. You got something that's helpful for working in a multilingual environment if you don't intend to read text in other languages but only to refer to it or recognize what it is. So I'd say that in a few cases that don't involve reading and writing or typography, this is potentially more useful, but it's important to be able to draw the distinction!

For example, it's helpful for being able to say "the Arabic name for the Arabic language is spelled ة-ي-ب-ر-ع-ل-ا, alif-lam-ayn-ray-baa-yaa-taa marbuta". But if you want to have an Arabic speaker read it as a word, it ought to be written "العربية" (RTL with ligatures). Your terminal environment is good for the former but not the latter.


> Indeed, GNU Unifont looks awful to many people, and even we only talk about bitmap fonts, Unifont is not the best-looking one.

And this is subjective. I have used Unifont in xterms for years now, and it's very readable and unambiguous to me.

So it's good to see it getting some positive notice.

> It's also a good choice for a dot-matrix LCD/LED display.

Only a high-resolution one.


Can you share the patch please? That sounds so useful!


I tend to think this is a bad idea. Most of the characters that can't already be displayed using fonts included in modern operating systems are likely to be characters from "obscure" scripts that may need complex shaping for correct/readable display, which won't work with Unifont because of its limited 1:1 character/glyph mapping, and/or they're characters (whether from historic scripts or newly-encoded emoji) in higher Unicode planes, meaning you'll need to provide not just the basic (plane-0) Unifont but also the extra resource for higher planes. It adds up to a lot of bloat, for the sake of inadequate rendering of characters that the user probably can't make sense of anyway.

The one case where there might be a worthwhile benefit would be for recent emoji additions. But that would be better addressed by a more limited effort to provide an up-to-date emoji-only font, not a resource that attempts (in vain!) to cover the whole of Unicode.


Not to mention that many emoji are emoji sequences. So again, not a 1:1.


https://www.google.com/get/noto/

Google's Noto font family is of far higher quality than Unifont and serves the same purpose. Better, too, since there's a lot of details in various languages and scripts that go far beyond "put a glyph here". I've heard only good things about Noto in that respect.


It's also 1.1GB!

And hasn't been updated since 2017 https://www.google.com/get/noto/updates/

But, other than that...


Almost as if one typeface covering every writing system on the planet isn't something you can do in just a few MB. Even a single comprehensive font for Chinese or Japanese ends up being _at least_ 12 MB if it wants to do things right.

Multiply that by some 120 writing systems, and you can see why things might get a little big if you want "one font for every language".

(which you can't, the spec doesn't allow for more than 65k glyphs, including virtual compound glyphs, per font. So I'm genuinely confused about the author's claim that they managed to put 137k glyphs in a 65k glyph space. You need multiple fonts tied together using a font family name)


https://github.com/googlei18n/noto-fonts How about this link? Seeing as the 4 Noto packages get pretty frequent updates in the Arch repositories, I think it’s being updated frequently. When compiled and stored on a transparent compression disk, it also doesn’t take a lot of space. Unable to check right now, but I would imagine like 200M total on NTFS or btrfs.


> And hasn't been updated since 2017

This is only a useful data point if there have been significant changes to "writing system" released since then.


Unicode 11 was released in 2018 and Unicode 12 in 2019. A "universal" font that doesn't keep up with Unicode isn't very universal.


It might not have had a release since 2017, but the github repository shows development as recent as 20 days ago.


I see a November 2018 release announcement.

https://github.com/googlei18n/noto-fonts/blob/master/NEWS.md


That’s for all 70+ families.


I am not an http expert, but could we put that file on one location on the web so that I do not download it for every fricking site I open? Or even better, could not the browser bundle it?


Why not just have it installed on your system, and disallow websites to dictate which fonts to use?

Edit to make the comment (hopefully) more valuable: I mean, unless certain choice of fonts is somehow intrinsic to the content or purpose a given website serves, the browser, and by extension the user, probably know better which fonts are preferable for them to comfortably read the textual content.


Web fonts have their place. For example they're extremely commonly used to efficiently serve icons that seamlessly blend into text across many websites.

In my previous app I used web fonts to be able to render Hearthstone cards using the correct font the game uses. In my previous app I used web fonts to


If browsers are to bundle fonts, they should bundle fonts that actually support correct rendering of the various scripts in Unicode - e.g. the Noto fonts. Unifont simply cannot render many languages correctly.

Better still, though, for operating systems to bundle adequate fonts. And indeed, OS-installed font coverage has come a long way in recent years.


"I am not arguing that you should serve an extra couple of MB on every webpage (although modern sites are so bloated it probably doesn't matter...) - but perhaps you should bundle it with your webview apps." The author already mentions the idea that it should be bundled with webview apps which also refers to browsers. A browser plugin could be a good starting point until the patched browsers are released.


It would also make sense to include it only as a fallback font, downloaded only if a Unicode point is actually used in the page which the primary font didn't cover, which I think is how '@font-face' would act if he didn't specify Unifont as the primary font in 'font-family'.


If I'm reading the license and the exception correctly, bundling in a non-GPL browser would violate the font's license.


As long as the browser is GPL-compatable it's okay. Chrome isn't, and Chromium and Firefox might not be either due to binary blobs for DRM stuff.


I've stuck it on GitLab - https://gitlab.com/edent/unifont/ - but yes, I agree with you. Every browser, OS, and Electron App could bundle it.


> Every b̶r̶o̶w̶s̶e̶r̶, OS, a̶n̶d̶ E̶l̶e̶c̶t̶r̶o̶n̶ A̶p̶p̶ could bundle it.

Every browser and Electron app could bundle it, but this seems like the "wrong" place to implement this. I'd suggest that they could bundle it temporarily until OSes catch up.


Or someone could add it with an extension as a current middle ground.


There's also https://www.google.com/get/noto/ which is what Ubuntu uses for emoji and is proportional and colored.


I do use noto for this, but absolutely hate the use of color in glyphs such as the emojis.


The black and white Noto emoji have not been updated in years sadly, but you can still pretend it before Noto Color Emoji. Alternatively prefer Symbola, which should be more up to date.


Couldn't you just strip the tables for the color map and the additional glyphs for the other color layers? Or does Noto use bitmaps or SVGs for the colored glyphs?


It's a scalable bitmap glyph as far as my understanding goes. "It's scalable even though it's a bitmap font." is what I remember reading last year, but I cannot find it for you, sorry.


From looking at the Github repository, it seems like the source images are SVG and the font might use either SVG or bitmaps, but not COLR/CPAL which would degrade nicely to black and white.


> If your app or website uses a Unicode character which isn't supported on a device, the user will usually see � - a replacement character. If you include Unifont, they'll see the correct character.

Neat idea. I think the transition to UTF-8 is practically done, I'm not seeing � anymore these days (used to be extremely common a while back).


This line is largely wrong.

Most systems, when called to display a character which they're unable to render, will render a placeholder. This is most often a dotted box of some sort, roughly the size of a large character. In some systems the dotted box (assuming it's large enough for them to be readable) contains the Unicode codepoint number that the system couldn't render. In a few the box contains some representative symbol that gives you a hint what sort of thing is missing, e.g. maybe it's a Han glyph to suggest that you should look for a Chinese font.

I haven't seen any (they may exist of course) where they render U+FFFD the replacement character �.

The most common reason to see U+FFFD is the reason it was created, something was encoded or decoded in a way that is gibberish and the best option in that case is to replace the minimum chunk of gibberish with U+FFFD and then keep trying. On the Web you'd often see pages which claimed to be UTF-8 but were actually ISO-8859-1 or Windows codepage 1252, neither of which is UTF-8 but they share the most common Latin characters, these days most browsers will auto-detect this goof, and besides most web pages really are UTF-8, but when browsers were less good at guessing and more pages were wrong you'd see it more often.


Yup, I screwed up with that title! See the discussion at https://twitter.com/FakeUnicode/status/1113774985116434433


Eliminating the replacement character is a function of glyph coverage in fonts, not of UTF-8 use.


UTF-8 helped with characters like “ ”. Back before it was enabled, all these sites pasted from MS Word didn't work well.


Question: how do you specify in CSS that you want font X, except if a specific glyph is not present, then you want font Y?


when you specify your font-stack, like:

    font-family: Helvetica, Arial, Sans-Serif;
it will fall through. so if a user has Helvetica installed, but Helvetica doesn't provide glyph X, then it will check whether Arial has glyph X.

so if you want all non-ascii glyphs to fall through to an alternative font, you need to serve a version of your primary font that only includes ascii.


Ok, I wasn't aware that stacking fonts works at the glyph level.


See also: https://github.com/rolandwalker/unicode-fonts (a different direction, starts from removing Unifont)


I can understand the utility of a basic fallback font that just works but if you are already building a large web client, why not just include the fonts you actually want to use? That way you get the look you want at whatever resolution without some horrible bitmap font popping up now and then.


Because you don't always know what kind of content you'll be displaying. Especially if you allow user-generated content.

Here's an example I found a few years ago - https://shkspr.mobi/blog/2015/11/premature-subsetting-of-web... - an English language website never expected their authors to use the é (e-acute) character. So they removed it from their webfont.


One of the issues when you "just include the fonts you actually want to use" is many fonts which get picked that way for "the look you want" only have latin glyphs[0]), and then you get an asiatic or arabic user, and depending on their setup your application may just fail to display their content entirely.

[0] or even just a subset if they got "optimised" by anglo-saxon developers who assume ü or ß are useless


I think it's aimed at e.g. browser or OS builders who can fall back to this font for single missing characters if need be.


I'd love to know how they're apparently fit more glyphs in a font than can fit in a font. The opentype specification only allows up to a USHORT worth of glyph ids, and 65335 ids is nowhere near enough to index even just all 137993 currently assigned code points.


They don't. There are multiple truetype fonts involved:

* The Standard Unifont TTF Download: unifont-12.0.01.ttf (12 Mbytes)

* Glyphs above the Unicode Basic Multilingual Plane: unifont_upper-12.0.01.ttf (1 Mbyte)

* Unicode ConScript Unicode Registry (CSUR) PUA Glyphs: unifont_csur-12.0.01.ttf (1 Mbyte)

(from http://unifoundry.com/unifont/index.html). And note that unifont_upper only seems to cover plane 1 and plane 14 stuff; they haven't attempted to tackle the plane 2 CJK repertoire.


That is really not what this article claims, though. It claims "It contains /every/ Unicode glyph in one single file!".


So it does - though later under "Use on the web", it turns out that it's primarily talking about the BMP, and higher-plane characters will require another file. (It doesn't seem to notice the lack of support for plane-2 CJK at all.)

The article title "Banish the � with Unifont" is also misleading, actually. � is U+FFFD, the REPLACEMENT CHARACTER that typically indicates an encoding error or binary garbage; it's not the same thing as the missing-glyph symbol (often a simple box, though it may vary) that generally appears when font support for a valid character is lacking.


What's the point of rendering characters that can't be understood by the reader? I can't read chinese, cyrillic, native american scripts, etc, so it's not worth the 12MB in apps and definitely websites.


If it renders in a facsimile of the correct script, this allows you to:

- see that it's foreign text, rather than pictographs or weird english text.

- see when the shapes of other languages are used to make pictures (¯\_(ツ)_/¯)

- if it's a foreign language, you can guess which language

- you can guess at the complexity of what was written, for example by looking for repeated substrings.


As a resident of a western country, I don't fully understand how to implement or test Unicode compatibility on my browser, website, terminal, etc. Is there a test suite of characters I can use to validate etc?


The question doesn't really make sense, because that's not what unicode is.

Unicode encodes what's necessary for printing books since about 1900 (and a bit more, but that's a fair one-sentence summary). What you want to validate isn't that you'd be able to print every kind of book printed since 1900. You're only interested in some of the alphabets, and you may be interested in more functionality than just printing. For example you may need sorting, or character input with the right sort of interactive appearance changes, or equality testing.

If you decide what you want to work, then googling usually finds a suitable test quickly.


Right, but if you're making a word processor or a web forum or a registration form what you want might be "Well, I don't speak languages that need complex scripts, but I'd be happy to support other people's scripts if it's easy"


The easiest test for "does my software handle Unicode somewhat better than dumbly" is emoji. If your users aren't already deluging you with emoji in their content in 2019, grab the emoji keyboard from your Operating System, often easy to find on most "soft keyboard overlays" such as mobile platforms. (In Windows 10 for the last year or so there are two keyboard shortcuts that work everywhere: Windows Key+. and Windows Key+;)

Many emoji these days are quite complex Unicode sequences with a number of them in the so-called "Astral Plane" meaning they need more than 16-bits to accurately display (proving you aren't treating UTF-8 or UTF-16 as if it was UCS-2), and as sequences include a lot of fun non-visible codepoints ("characters") such as the Zero-Width Joiner, and are very susceptible to breaking if accidentally dropped, reordered, or otherwise spliced (possibly proving you aren't doing back string math or manipulation at the codepoint level rather than the glyph/sequence/combined-character level).

[ETA: Useful sequences to test are any that support the skin-tone and gender modifiers. On Windows, the various "cat occupation" emoji are also interesting sequences such as ninja cat and astro cat. Other platforms have similar unique "fun" sequences that are noticeable at a glance when right/wrong.]

It's not entirely true that if you support emoji well you support any Unicode user's script well, but if you support emoji well you probably don't do anything particularly stupid to make other Unicode users unhappy.


Indeed. And if you're doing other things your test is a different one. It depends.

BTW, if you want to discuss which languages's scripts are complex… office, office, office, office.


I’ve found Markus Kuhn’s FAQ to be quite helpful [1], which links to things like test inputs [2].

[1] https://www.cl.cam.ac.uk/~mgk25/unicode.html

[2] https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt


interesting that the author used images of the font rather than actually embedding the font into this webpage




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: