Internationalization is a common need for applications, yet is one of the last things considered in many projects. Retrofitting to allow internationalization can be complex and cumbersome by adding API calls and creating disparities between string literal and resource reference processing. This article introduces a technique for referencing resources using keys encoded within normal character strings, delimited by Unicode SOS/ST character combinations, allowing easy localization of an application without changing its logic or undertaking complex internationalization renovations. String resource references bring the added benefits of recursive references and the ability to mix with literal character sequences.
Internationalization Using String Resource References
If there is anything that most modern software applications share, it is the need for internationalization. Globalization is making inevitable interaction with users whose native languages are not English—indeed, off-shoring makes it not unlikely that even the application developers speak English as a second language, if at all. Application localization, and the tools for doing it, is becoming a ubiquitous need for many reasons, not least of which for the convenience of updating application text without recompiling.
Inevitably, internationalization is one of the last things considered in many applications because developers are concentrating primarily on functionality. With feature creep combining with time constraints, the initial focus is on making the thing work. The ability to change the text is put off until after the initial launch date, if the project ever reaches it.
Once a decision has been made to internationalize, the task of localization is usually cumbersome. Going through a code base and translating all hard-coded strings to another language is simply not feasible. Even if the original developers were foresighted enough to include references to a separate class of constants, it's imprudent to allow translators access to the code base and inconvenient to swap out constant classes and recompile when localization is desired.
Techniques for Resource Lookup
Many modern languages and frameworks allow some sort of facility for grouping strings into external sets, each usually keyed to some resource ID, allowing literal text to be swapped out at will. Java, for instance, provides the abstract ResourceBundle class, implementations of which allow string literals to be grouped and accessed via resource keys. Java resource bundles even provide the capability of automatically loading appropriate bundles based upon a given locale, using resources_fr.properties for a French locale, for example.
The problem of referencing those resources using resource keys from within an application still remains, however. Rummaging through an existing application and replacing literal string constant references with the equivalent of resourceBundle.getString(resourceKey) is an arduous task. A more delightful solution would allow delegation of actual string resource lookup to some framework, and simply change the string literals to resource keys.
Early versions of the Guise™ Internet application framework took just that sort of approach in a brute force manner. Every component string property, such as Button.setLabel(label), would have a parallel resource-lookup version, such as Button.setLabelResourceKey(resourceKey). The application developers needed to change string literals to resource keys, and then change the method used to the resource-lookup version. The application framework then would load the correct resources automatically if a resource key was provided, or display the given string if literal text was provided. But, this solution is less than ideal, and has several shortcomings:
- It bloats the API by doubling the number of methods for accessing resource-aware properties.
- It is complex; resolving a property involves reading two separate attributes and making a determination of which to use, and how.
- It makes communication of values difficult, because transferring a property value involved transferring two values: a possible literal value and a potential resource key.
- It is error-prone, as both literal strings and resource keys have the same type, making their use as method parameters confusing.
- It makes interoperability between frameworks with no knowledge of resource keys difficult, such as passing a resource reference to an exception and later to a message notification.
The last point is especially important. Perhaps one part of the framework needs to report a validation error by throwing an exception. The framework could report a string literal by using an existing exception class. But, what if the exception method were retrieved from another part of the framework that provided resource keys? The framework could either resolve the string resource immediately, or resign to provide custom exceptions allowing both for resource keys and for string literals, increasing complexity and propagating all the shortcomings already mentioned.
Resource Reference Strings
A better approach would be to allow resource references to somehow be encoded within string literals, using the same framework properties both for string literal representation and for string resource reference. For URI properties, this is straightforward: A resource scheme can be used, as Mozilla does with chrome, to reference URIs stored in a resource bundle by using the form resource:resourceKey. But how can this be accomplished for strings? There would have to be some delimiter to introduce a resource reference, but whatever delimiter is used would itself have to be encoded if meant as a literal. Requiring certain characters to be encoded would, in effect, create an independent syntax for all strings used in an application, creating a complexity that would nigh outweigh its internationalization benefits, not to mention create substantial confusion and introduce interoperability obstacles.
An elegant solution lies in choosing delimiter characters that semantically denote string encoding and that are unlikely to be used normally within modern applications. Two characters that seem ideal for the task are the Start of String (U+0098) and String Terminator (U+009C) characters, referred to here as SOS and ST. These two characters are Unicode control characters governed by ISO/IEC 6429, which is also available for free as ECMA-48.
The ISO 6429 control codes are those used by the DEC VT100 video terminals. ECMA-48 describes a control string as "a string of bit combinations that may occur in the data stream as a logical entity for control purposes. A control string consists of an opening delimiter [such as START OF STRING (SOS)], a command string or a character string, and a terminating delimiter, the STRING TERMINATOR (ST)." Most importantly, "The interpretation of … the character string is not defined by this Standard, but instead requires prior agreement between the sender and the recipient of the data." (ECMA-48, 5.6). In other words, demarcating a string with a custom interpretation is a valid—indeed, the only valid—purpose for an SOS/ST pair.
The SOS/ST pair was made for the purpose of delimiting character sequences that have application-specific meaning—there's no worry that other strings use these characters for some other purpose. It is only possible that another application has given a different meaning, not to the control characters, but to the delimited string. Luckily, it is extremely unlikely that any normal source of character strings will be generating ISO 6429 control strings, so they can safely be interpreted by the application as representing resource keys. As an added benefit, although XML 1.0 (Third Edition) discourages the use of control characters in XML files (likely because no one thought of a use for them), these characters are still valid for XML storage and will cause no parsing problems.
An application framework, therefore, can allow string resource references to be assigned in the same way as are literal string values, with resource references denoted by the presence of an SOS/ST-delimited control string. Using Guise™, which now supports SOS/ST string resource references, as an example, an application may use a string literal or a resource reference in the same context (using Java's facility for specifying Unicode characters as code points):
button1.setLabel("Hello, World!"); button2.setLabel("u0098hello.world.labelu009C");
Now, actual string resource resolution can be relegated to a single area of the framework and can take place at render time, transparent to the application. Furthermore, strings can be passed along from different areas of the application without worrying whether the strings are string literals or resource references. One framework component, for example, may construct and throw an exception using an error string, and an application may display that string in a label or in a dialog, all oblivious to whether the string is a literal value or a resource reference. If an application ever does need to process a string as a string literal, it can ask the framework to resolve the string; the Guise™ framework, for example, provides such functionality in GuiseSession.resolveString(String).
Encoding resource references into literal strings brings two important benefits for free. Resource resolution can be recursive, allowing strings within resource bundles to reference other string resources. In addition, one or more resource references can be embedded among literal character sequences in a string. To illustrate, consider a Java application that stores its string resources in a resource bundle serialized in a properties file, resources.properties:
hello.label=Hello world.label=World hello.world.label=u0098hello.labelu009C, u0098world.labelu009C!
The button label indicated earlier could be used:
At render time, the framework would examine the button label and find the resource key hello.world.label, which would be resolved to "u0098hello.labelu009C, u0098world.labelu009C!". Recursively resolving the two embedded resource references would yield the literal string "Hello, World!" which then would be used for rendering. Localizing this application would mean only changing the values of the hello.label and world.label resource keys—the string resource identified by the hello.world.label resource key would not need to be changed.
Resource reference strings use international standard string delimiters to allow easy localization of an application without changing its logic or undertaking complex internationalization renovations. String resolution can be relegated to a single point within a framework, allowing the application at large to use strings opaquely without regard to whether they contain string literals or resource references. Resource reference strings allow recursive references and can be interspersed with literal character sequences. Using the SOS/ST character combination minimizes the risk of delimiter overriding because control strings should never pass beyond an application context, and can be serialized in XML as an application resource storage format.
About the Author
Copyright © 2006 Garret Wilson.