The Challenge of Special Characters for Integrations
At Dispatch, the integrations we build range from relatively simple (moving data from field A in system A to field B in system B) to extremely sophisticated, involving data transformations, enrichment, and advanced logic. At the most sophisticated end of the spectrum, integrations can effectively be considered co-processors that can perform complex calculations on huge amounts of data to support custom business processes.
Integration specialists can build incredibly advanced integrations using code or Integration Platforms as a Service (iPaaS) based on fundamental integration patterns and practices. Special Character Handling is one of the most important fundamental concepts to understand and master, especially when building integrations that could be used in a global context.
So what are “special characters,” and what do you need to know about them?
A Little Bit of History:
Characters are the letters, numbers, and symbols used in languages and math and comprise the essential elements of what we call “data.” In the early days of computing, there were no standards to describe these elements, and many manufacturers produced teleprinters with custom logic that mapped a key press to a specific letter, number, or symbol.
When computing began to be interoperable, it became clear that we needed a standard to ensure all manufacturers described these characters the same way. In the mid-1960s, a standard called the American Standard Code for Information Interchange (ASCII) was created, which, for the first time, ensured that every computing device would have a core ability to understand the input or output from other computing devices.
ASCII is still very much in use today. It is a 7-bit binary system that maps 128 characters to binary codes. The first 32 characters in the ASCII alphabet aren’t actually human-readable characters. Instead, they define computing actions or concepts such as “null,” “start of text,” and “escape.” The adoption of the ASCII standard revolutionized computing at that time and defined the English alphabet, numbers, punctuation, and basic mathematical symbols.
While ASCII is great, it’s obvious that there’s a problem. ASCII only maps common English letters and numbers. But what about other languages? Greek, Arabic, Cyrillic, Japanese, Korean, Chinese, and dozens of other languages are completely ignored. And how do you map emojis 😀🙄🤓?
As computing advanced, it was clear that ASCII was an insufficient standard. To address this issue, ASCII was extended to add an additional 128 characters by introducing an extra bit to the original 7-bit standard. This helped, but clearly wasn’t sufficient to encode all the characters used by all languages.
The introduction of Unicode in the late 1980s was the initiative that began codifying all symbols for all languages globally. Unicode introduced 8, 16, and 32-bit encoding patterns that currently cover over 160 languages and scripts, including current and historical languages and, of course, emojis. There are more than a million unique elements coded in Unicode, and each is referred to as a Code Point.
Character Handling is All About Mapping
The Unicode standard has helped create a universal basis for translating data in one system into data in other systems. But unfortunately, this has not eliminated the need for integrators to worry about special characters.
Despite the progress in establishing a standard, many (perhaps most) enterprise systems still do not encode data in a truly universal manner. There are actually subsets of Unicode standards, and different systems use different standards. Even with English-only systems, some systems still encode data in plain old ASCII, and some encode data in a subset of Unicode, such as Latin Supplement-1, Extended-A, or Latin Extended-B. This issue gets far more complex when systems are configured to handle multiple languages and use different mappings for non-English characters.
If you try to move data from a system with one encoding table to another with a different table, you will get unexpected results and often garbled output. You have undoubtedly seen this output in emails, text messages, and on the internet. “You’ve” becomes “You’ve”, “Français” becomes “Fran軋is,” and “文字化け” becomes “��絖�����”
There’s even a term for this mis-mapping of characters called Mojibake, which is a Japanese word that means “character changing.”
Integrations must be designed with an understanding of the Unicode mapping standards of upstream and downstream systems, and when they are not 100% identical, introduce mapping tables to translate the output from one system into an acceptable input for another system.
Common Mappings in HR Systems:
In our experience, many companies in North America still use legacy systems that encode data using basic character sets. For example, we often see insurance and benefits providers use systems that cannot interpret any special characters beyond the original 127 in ASCII.
When we integrate with these systems (for example, to send benefits data from an HR system to a benefits provider), we need to understand how data is encoded in the HR system and build a mapping table to replace characters that would otherwise be unrecognizable in the benefits system.
Modern HR cloud applications serving Europe and the Americas often use Latin Extended character sets to ensure names, addresses, and other information is captured and rendered for common European-based languages. In these cases, we first need to clarify which encoding system is used (e.g., Basic Latin, Latin-1 Supplemental, Latin Extended-A, Latin Extended-B…). Then, we “normalize” these characters to the Unicode standard UTF-8. Finally, during the transfer of data from these systems to a downstream application, we convert these UTF-8 characters to plain English (ASCII) mappings.
In some cases, this is a one-to-one conversion (à might be mapped as a, ö might become o). In some cases, a single character may be mapped to two or more plain English characters (ø becomes oe, ü becomes ue). This mapping attempts to preserve as much information as possible in the translation so that while the special character is removed, the meaning is retained.
Integration logic must detect these special characters in data as it is picked up from upstream systems and replace them with the corresponding ones from the mapping table. This typically happens in real-time prior to the data being sent to the downstream application.
The end result of this integration logic is ungarbled (although Anglofied) data in the downstream system.
When Do We Need to Stop Worrying about Special Characters?
Garbled data and incorrect character mapping have negative consequences. In the worst case, the data loses some or all value (what the heck does ������� mean?) Even if you can figure out what the data means, it is irritating and unacceptable for people to have their names and addresses garbled, especially in the business systems their employer uses.
While significant progress has been made in establishing a global character encoding standard, unfortunately, we are far away from achieving automatic and correct character translation between all systems. Business systems of various vintages that have idiosyncratic encoding will be in use for years to come. Even with modern cloud-based systems, companies may still use Unicode standards that make sense within the context of their single application but require mapping and translation to communicate with the outside world.
Understanding character encoding and the fundamentals of Unicode are core requirements for integration developers. As an integration specialist, part of your job is to ensure data retain maximum value as it flows between systems. Special character handling is an essential part of the job. Luckily, there are standard approaches to address this problem.