Working with Textual Data: Be Prepared for Unexpected Problems

Thursday Feb 8th 2007 by Alex Gusev

Mobile development has steadily become more than just a 'nice-to-have' feature, thanks to the permanently growing power of PDAs. Many desktop applications were ported to run in a mobile environment. Learn about a few underwater stones you might face when handling textual data.


The objectives of this article are quite straightforward. It combines a few quite different aspects of textual data processing, including conversions between different encodings and BSTR usage.

It's All about UNICODE

As you definitely know, Windows CE OS is built as a native UNICODE system, just like Windows NT. Symbian OS does the same. It means that every character occupies exactly two bytes, so it uses UCS-2 encoding. This isn't the only possible case; for example Linux uses UCS-4 (4 bytes for a single character). How does it relate to your application at all? Well, you have at least two cases here.

The simpler one is when you start developing a brand-new, shiny application for a mobile platform from scratch. When you design it, you need to keep in mind what kind of data will it work with to predict possible clashes. Besides, it is UNICODE inside whether you like it or not.

The second scenario is when you try to compile/port your existing desktop code for Windows Mobile. The very first outcome is that it might cost you a few hours (or even maybe days) to get it working if you weren't lucky enough to embrace all your text in encoding-safe macros, such as _T(), TEXT(), or whatever else you use for this purpose. The same is true for character-based types like LPSTR or char, which have to be replaced by LPTSTR and TCHAR respectively.

Another aspect of the same "type-change" problem is that, in most cases, there are two versions of the same Win32 API function: one for wide character arguments (with "A" suffix) and one for multibyte ones (with "W" at the end). It shouldn't be a big problem, though. A number of fixes here and there—and everything takes off happily. But, what if you handle textual data, such as ASCII? In this case, you have no choice but to either perform the required conversions or continue with the same data processing. The latter might cause you to spend additional time to adjust the code to make it happen. Here is the point to and from UNICODE for different types of text encodings.

ASCII, UTF-8, What's More?

Consider the scenario when your application should somehow process something different than UNICODE text input. This is quite a common case because data may be prepared on a PC and be a plain aand simple UTF-8 (or ASCII) file, or a field in a database was declared as single char type, or you may think out an indefinite number of your own examples. Moreover, if you use files, they may have localized names. As a bottom line, you have to decide how to deal with all this stuff.

Because a binary tree pattern for use-cases seems to be employed in this article, you might consider two possible solutions for conversions: to utilize existing CRT libraries or Win32 APIs. The former may be a natural choice for your application, but the real problem is that CRT functions such as mbstowcs() don't work correctly with all code pages; for example, with some narrow end part of a Katakana table. Hence, if you have to target such languages, Win32 APIs are the only choice left:

   IN UINT     CodePage,
   IN DWORD    dwFlags,
   IN LPCSTR   lpMultiByteStr,
   IN int      cbMultiByte,
   OUT LPWSTR  lpWideCharStr,
   IN int      cchWideChar);

   IN UINT     CodePage,
   IN DWORD    dwFlags,
   IN LPCWSTR  lpWideCharStr,
   IN int      cchWideChar,
   OUT LPSTR   lpMultiByteStr,
   IN int      cbMultiByte,
   IN LPCSTR   lpDefaultChar,
   OUT LPBOOL  lpUsedDefaultChar);

You will focus on the first two parameters of the functions above. CodePage defines the desired codepage you're going to convert to or from. For most cases, you may choose CP_UTF8. As the documentation says, both functions work faster if dwFlags are not set. Unless you have to work with more complex characters, such as "h", that's all you need for back and forth text conversions. One more useful feature of those functions is that they return the buffer size required for conversion if specified. So, a typical call may look like this:

// First, get required buffer length
DWORD dwSize = ::MultiByteToWideChar(CP_UTF8,0,pMultibyteBuffer,
                                     -1, NULL, 0);
// Allocate it
TCHAR pWideBuffer = new TCHAR[dwSize];
// Second, make the conversion
::MultiByteToWideChar(CP_UTF8,0,pMultibyteBuffer, -1, pWideBuffer,

The second parameter, dwFlags, controls how those functions work with composite characters (like "h", which consists of "e" as the base character and a "grave accent" character as a nonspacing character) and invalid characters. You can play with the dwFlags value on your own.

Verifying Text Input

Now, consider the following situation: You have some multibyte text input, which should be treated as UTF8, but you need to check it to reject other encodings. This is not so uncommon; you may have a file with some national encoding like Japanese (Shift-JIS). Here comes a natural requirement: to verify an input prior to passing it to conversion functions. According to the UTF-8 table, it can be done similarlu to the following code snippet:

BOOL IsCorrectUTF8Buffer(LPSTR pMultiByteBuf, DWORD dwNumBytes)
   for (DWORD i = 1; i < dwNumBytes; i++)
      // 1. check if the uppermost bit in current byte is set
      if ( (pMultiByteBuf[i] & 0xC0) == 0x80 )
         // 2. if previous byte has it reset
         if ( (pMultiByteBuf[i-1] & 0x80) == 0x00 )
            return FALSE;
         // 3. another case:
         // lead-byte of a 2 byte sequence, but code point <= 0x7F
         if ( (pMultiByteBuf[i-1] & 0xC0) == 0xC0 )
            return FALSE;

   return TRUE;

The function above simply scans the input text buffer until it finds the incorrect UTF-8 character. Such an approach may help you detect an input error at the early stages and respond appropriately.

BSTR & Co.

Here, I will discuss a slightly different area of where the UNICODE meaning is also applicable—the BSTR data type. As you may observe in the SDK headers, BSTR is defined as

typedef OLECHAR *BSTR;

It is a wide characters string. In fact, you can store any binary data there because BSTR puts its buffer length at the beginning, so a NULL line terminator is not necessary. In case of string data, you can see that BSTR is represented as a UNICODE string.

Where do you use BSTR on Windows Mobile? Well, MS XML, other COM-related areas, database data—those are just few examples. To help you work with BSTR, the ATL library contains two helper classes: _bstr_t and CComBSTR. Look into comutil.h and atlcom.h to get more details about the above classes and a couple of conversion functions. Both CComBSTR and _bstr_t provide a similar but a slightly different interface, so which one to use is up to your task scope and convenience. Obviously enough, you will see some minor differences in MFC 3.0 and MFC 8.0 implementations.

Look at a few code snippets to illustrate the concept:

Sample 1: Binary data and BSTR

// here is our binary data
BYTE bt[] = { 1,2,3,4 };
// here we allocate BSTR
BSTR bstrBuff = SysAllocStringLen((LPOLESTR)bt,sizeof(bt));
_bstr_t bstrt(bstrBuff);
// here we get a copy back
BSTR btOut = bstrt.copy();
// our data is back again
BYTE *pOut = (BYTE*)btOut;
// tidy up

Sample 2: Raw conversions

#include <comutil.h>
using namespace _com_util;
BSTR bstrData = ConvertStringToBSTR("Sample");
UINT nLen     = SysStringLen(bstrData);
char* pChr    = ConvertBSTRToString(bstrData);
delete [] pChr;

Sample 3: CComBSTR usage

CString sQuery;
CComBSTR bstrQuery(sQuery);
IXMLDOMNodePtr pNode;
HRESULT hr = pXmlDoc->selectSingleNode(bstrQuery,&pNode);
if ( SUCCEEDED(hr) && pNode != NULL )
    // do something


This article hopefully answered most of your initial questions about UNICODE and ASCII/UTF-8 textual data processing. In regard to text in COM or plain files, databases, or XML, you now can conquer them all. It may be a bit more complicated sometimes, but in most cases, all is quite straightforward once you understand the basics. That's it!

About the Author

Alex Gusev started to play with mainframes at the end of the 1980s, using Pascal and REXX, but soon switched to C/C++ and Java on different platforms. When mobile PDAs seriously rose their heads in the IT market, Alex did it too. After working almost a decade for an international retail software company as a team leader of the Windows Mobile R department, he has decided to dive into Symbian OS ™ Core development.

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved