Discussion:
Unicode and strings
(too old to reply)
Jean
2010-04-07 06:15:46 UTC
Permalink
Hello

(pure C and SDK, XP/7)

I try to read a directory with cyrillic and western mixed file names.
I use FindFirstFileW, WideCharToMultiByte(CP_UFT8...

when i write the filename to a text file all is correct (with fwrite)
when i add the filename to a listbox i get garbage, idem if i use
MessageBoxA to display the filename

any idea ?
Jean
Xavier Roche
2010-04-07 06:28:43 UTC
Permalink
Post by Jean
when i write the filename to a text file all is correct (with fwrite)
when i add the filename to a listbox i get garbage, idem if i use
MessageBoxA to display the filename
Either use MessageBoxW() or SetThreadLocale() to change the current
locale with an UTF8 codepage.
Jean
2010-04-07 06:44:45 UTC
Permalink
I tried SetThreadLocale but it's the same effect

Jean
Post by Jean
when i write the filename to a text file all is correct (with fwrite)
when i add the filename to a listbox i get garbage, idem if i use
MessageBoxA to display the filename
Either use MessageBoxW() or SetThreadLocale() to change the current locale
with an UTF8 codepage.
Friedel Jantzen
2010-04-07 17:14:44 UTC
Permalink
Hello Jean,

please post the code.
Post by Jean
Hello
(pure C and SDK, XP/7)
I try to read a directory with cyrillic and western mixed file names.
I use FindFirstFileW, WideCharToMultiByte(CP_UFT8...
when i write the filename to a text file all is correct (with fwrite)
If you converted the strings to UTF8, you cannot add them to a Windows
control. If your project is Unicode, strings must be coded UCS-2 LE (WCHAR,
WideChar) to use them for the APIs.
Post by Jean
when i add the filename to a listbox i get garbage, idem if i use
MessageBoxA to display the filename
I do not understand what you mean by "idem" (English ist not my mother
tongue)?
All unicode APIs expect WideChar, not UTF8.

Regards,
Friedel
Jean
2010-04-08 05:14:14 UTC
Permalink
Hello Friedel
Here is the code:

int how_many_fichiers_unicode(char *rep) {
HANDLE hh;
wchar_t wrep[_MAX_PATH];
char full_name[_MAX_PATH]
int nbFic=0;
WIN32_FIND_DATAW wfind;
int ret, sz;
LPSTR ansi;

#define UNICODE
#define _UNICODE

ret = mbstowcs(wrep, (const char*)rep, _MAX_PATH);

hh = (HANDLE)FindFirstFileW(wrep, &wfind);
if(hh == (HANDLE)0xff) return -1;

do {
if(wfind.dwFileAttributes & _A_SUBDIR) {
continue;
}
else
if(wfind.dwFileAttributes & _A_HIDDEN) {
continue;
}

memset(full_name, 0, _MAX_PATH);
sz = WideCharToMultiByte(CP_UTF8, 0, wfind.cFileName, -1, (LPSTR)NULL, 0,
NULL, NULL);
ansi = HeapAlloc(GetProcessHeap(), 0, sz+1);
ret = WideCharToMultiByte(CP_UTF8, 0, wfind.cFileName, -1, (LPSTR)ansi,
sz+1, NULL, NULL);

MessageBox(NULL,ansi,"",MB_OK);

HeapFree(GetProcessHeap(), 0, ansi);
nbFic+=1;
} while(FindNextFileW(hh, &wfind));
FindClose(hh);

#undef _UNICODE
#undef UNICODE
return nbFic;
}

Jean
Post by Friedel Jantzen
Hello Jean,
please post the code.
Post by Jean
Hello
(pure C and SDK, XP/7)
I try to read a directory with cyrillic and western mixed file names.
I use FindFirstFileW, WideCharToMultiByte(CP_UFT8...
when i write the filename to a text file all is correct (with fwrite)
If you converted the strings to UTF8, you cannot add them to a Windows
control. If your project is Unicode, strings must be coded UCS-2 LE (WCHAR,
WideChar) to use them for the APIs.
Post by Jean
when i add the filename to a listbox i get garbage, idem if i use
MessageBoxA to display the filename
I do not understand what you mean by "idem" (English ist not my mother
tongue)?
All unicode APIs expect WideChar, not UTF8.
Regards,
Friedel
Ulrich Eckhardt
2010-04-08 07:00:57 UTC
Permalink
Post by Jean
#define UNICODE
#define _UNICODE
Stop, this doesn't work. These are evaluated in a few headers. Once these
headers are included, and they are at this point, these macros don't have
any further effect. Define them on the commandline or in a central header
included first by every file of your project. Otherwise, the chance of
having those inconsistent for different parts of your code is just too big.
If you're using MSVC, set the charset to "Unicode" in the "general" tab of
the project's settings. In any case, these only affect what TCHAR is, but
you are (almost) not using any TCHAR functions but the explicit *W or *A
versions.
Post by Jean
hh = (HANDLE)FindFirstFileW(wrep, &wfind);
if(hh == (HANDLE)0xff) return -1;
Two big nonos here:
1. Don't cast the returnvalue of FFF, use the correct type on the left side
instead.
2. The value it returns in case of errors is documented, use that macro
instead. I think it's INVALID_HANDLE_VALUE, but it could also be NULL.

Read the documentation at http://msdn.microsoft.com and/or install a plugin
for Firofox that looks up such functions there.
Post by Jean
if(wfind.dwFileAttributes & _A_SUBDIR) {
You generally never need anything starting like _THIS, i.e. underscore
followed by an uppercase letter. Again, read the documentation for the file
attributes, the constants are documented there.
Post by Jean
sz = WideCharToMultiByte(CP_UTF8, 0, wfind.cFileName, -1, (LPSTR)NULL, 0,
NULL, NULL);
ansi = HeapAlloc(GetProcessHeap(), 0, sz+1);
ret = WideCharToMultiByte(CP_UTF8, 0, wfind.cFileName, -1, (LPSTR)ansi,
sz+1, NULL, NULL);
MessageBox(NULL,ansi,"",MB_OK);
How about a straightforward "MessageBoxW(NULL, wfind.cFileName, ...)"
instead? This will work.

Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
Jean
2010-04-08 08:11:43 UTC
Permalink
Ulrich, thank you for the answer.

I defined UNICODE because it did not want to modify all my program (several
thousands of lines).
I thought that i could use it in a small module just to read file names,
then add these file names to a syslistview.
But it seems that i will probably not be able to do that.
Must i declare all my project as UNICODE and modify declarations such as
char -> w_char, etc... ?

(_A_SUBDIR is described in microsoft.com)

Jean
Post by Ulrich Eckhardt
Post by Jean
#define UNICODE
#define _UNICODE
Stop, this doesn't work. These are evaluated in a few headers. Once these
headers are included, and they are at this point, these macros don't have
any further effect. Define them on the commandline or in a central header
included first by every file of your project. Otherwise, the chance of
having those inconsistent for different parts of your code is just too big.
If you're using MSVC, set the charset to "Unicode" in the "general" tab of
the project's settings. In any case, these only affect what TCHAR is, but
you are (almost) not using any TCHAR functions but the explicit *W or *A
versions.
Post by Jean
hh = (HANDLE)FindFirstFileW(wrep, &wfind);
if(hh == (HANDLE)0xff) return -1;
1. Don't cast the returnvalue of FFF, use the correct type on the left side
instead.
2. The value it returns in case of errors is documented, use that macro
instead. I think it's INVALID_HANDLE_VALUE, but it could also be NULL.
Read the documentation at http://msdn.microsoft.com and/or install a plugin
for Firofox that looks up such functions there.
Post by Jean
if(wfind.dwFileAttributes & _A_SUBDIR) {
You generally never need anything starting like _THIS, i.e. underscore
followed by an uppercase letter. Again, read the documentation for the file
attributes, the constants are documented there.
Post by Jean
sz = WideCharToMultiByte(CP_UTF8, 0, wfind.cFileName, -1, (LPSTR)NULL, 0,
NULL, NULL);
ansi = HeapAlloc(GetProcessHeap(), 0, sz+1);
ret = WideCharToMultiByte(CP_UTF8, 0, wfind.cFileName, -1, (LPSTR)ansi,
sz+1, NULL, NULL);
MessageBox(NULL,ansi,"",MB_OK);
How about a straightforward "MessageBoxW(NULL, wfind.cFileName, ...)"
instead? This will work.
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
Timo Kunze
2010-04-08 08:20:36 UTC
Permalink
Post by Jean
(_A_SUBDIR is described in microsoft.com)
Yes, it is, but in a different context. If you want to use the _A_*
constants, then you should also use _find/_wfind instead of FindFirstFile.

Timo
--
www.TimoSoft-Software.de - Unicode controls for VB6
"Those who sacrifice freedom for safety deserve neither."
"Demokratie ist per Definition unsicher. Ihr Schutz entsteht aus der
Überzeugung, dass die demokratischen Kräfte überwiegen und sich – auf
demokratischem Wege – durchsetzen."
Jean
2010-04-08 08:22:45 UTC
Permalink
OK, got it, thank you

Jean
Post by Timo Kunze
Post by Jean
(_A_SUBDIR is described in microsoft.com)
Yes, it is, but in a different context. If you want to use the _A_*
constants, then you should also use _find/_wfind instead of FindFirstFile.
Timo
--
www.TimoSoft-Software.de - Unicode controls for VB6
"Those who sacrifice freedom for safety deserve neither."
"Demokratie ist per Definition unsicher. Ihr Schutz entsteht aus der
Überzeugung, dass die demokratischen Kräfte überwiegen und sich - auf
demokratischem Wege - durchsetzen."
Ulrich Eckhardt
2010-04-08 10:04:33 UTC
Permalink
Post by Jean
I defined UNICODE because it did not want to modify all my program
(several thousands of lines).
I thought that i could use it in a small module just to read file names,
then add these file names to a syslistview.
No, keep this thing consistent throughout all your code. If you want parts
of the code to use the wchar_t APIs, use them explicitly.
Post by Jean
Must i declare all my project as UNICODE and modify declarations such as
char -> w_char, etc... ?
No, you got that backwards. UNICODE/_UNICODE don't affect the meaning of
char or wchar_t at all! The only thing they affect is TCHAR, which resolves
to char normally and to wchar_t when the macros are defined. Similarly, all
functions and types depending on TCHAR are affected, like e.g. LPCTSTR or
CreateFile.

Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
Jean
2010-04-08 10:16:37 UTC
Permalink
OK, thank you

Jean
Post by Ulrich Eckhardt
Post by Jean
I defined UNICODE because it did not want to modify all my program
(several thousands of lines).
I thought that i could use it in a small module just to read file names,
then add these file names to a syslistview.
No, keep this thing consistent throughout all your code. If you want parts
of the code to use the wchar_t APIs, use them explicitly.
Post by Jean
Must i declare all my project as UNICODE and modify declarations such as
char -> w_char, etc... ?
No, you got that backwards. UNICODE/_UNICODE don't affect the meaning of
char or wchar_t at all! The only thing they affect is TCHAR, which resolves
to char normally and to wchar_t when the macros are defined. Similarly, all
functions and types depending on TCHAR are affected, like e.g. LPCTSTR or
CreateFile.
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
r***@pen_fact.com
2010-04-08 19:05:05 UTC
Permalink
On Thu, 08 Apr 2010 12:04:33 +0200, Ulrich Eckhardt
Post by Ulrich Eckhardt
Post by Jean
I defined UNICODE because it did not want to modify all my program
(several thousands of lines).
I thought that i could use it in a small module just to read file names,
then add these file names to a syslistview.
No, keep this thing consistent throughout all your code. If you want parts
of the code to use the wchar_t APIs, use them explicitly.
Post by Jean
Must i declare all my project as UNICODE and modify declarations such as
char -> w_char, etc... ?
No, you got that backwards. UNICODE/_UNICODE don't affect the meaning of
char or wchar_t at all! The only thing they affect is TCHAR, which resolves
to char normally and to wchar_t when the macros are defined. Similarly, all
functions and types depending on TCHAR are affected, like e.g. LPCTSTR or
CreateFile.
UNICODE also affects definitions of macros used for function names. So
MessageBox is defined as MessageBoxW when UNICODE is defined, but
defined as MessageBoxA when UNICODE is not defined.
Post by Ulrich Eckhardt
Uli
-----------------------------------------
To reply to me, remove the underscores (_) from my email address (and please indicate which newsgroup and message).

Robert E. Zaret, eMVP
PenFact, Inc.
20 Park Plaza, Suite 400
Boston, MA 02116
www.penfact.com
Useful reading (be sure to read its disclaimer first):
http://catb.org/~esr/faqs/smart-questions.html
Friedel Jantzen
2010-04-08 08:53:00 UTC
Permalink
Hello Jean!

As Ulrich replied, you can simplify it all by setting the Unicode option in
your project, if you are using MSVC.
Other compilers may require your defines
#define UNICODE
#define _UNICODE
as the first lines of the first header;
if you do not use any header file, as the first lines of your c file.

Many names for the API functions are macros, and with UNICODE defined they
and the Windows structs are automatically set to the unicode versions.
This means: You need not write FindFirstFileW, FindFirstFile will mean the
same.
And these WideChar versions expect WCHAR string parameters:
To keep it easy, use WCHAR always, no conversions needed.

int how_many_fichiers_unicode(WCHAR *wrep) {

int nbFic=0;
WIN32_FIND_DATA wfind; // This is WIN32_FIND_DATAW
HANDLE hh = FindFirstFile(wrep, &wfind);
// use INVALID_HANDLE_VALUE!
// Search msdn for FindFirstFile
if(hh == (HANDLE)INVALID_HANDLE_VALUE) return -1;

do {
if(wfind.dwFileAttributes & (FILE_ATTRIBUTE_HIDDEN |
FILE_ATTRIBUTE_DIRECTORY)) continue;
nbFic++;
MessageBox(NULL, wfind.cFileName, L"Test", MB_OK); // This is
MessageBoxW
} while(FindNextFile(hh, &wfind));

FindClose(hh);
return nbFic;
}

Regards,
Friedel
Jean
2010-04-08 09:17:47 UTC
Permalink
Post by Friedel Jantzen
if you are using MSVC
yes, i use MSVC6 and pure C SDK.
i set UNICODE in the project and got 34765 warnings :-))

using WCHAR is easy, thank you for the tip.
i suppose that i lust use _wsplitpath instead of _splitpath, swprintf
instead of wsprintf, wcscpy instead of strcpy etc... ?

Jean
Post by Friedel Jantzen
Hello Jean!
As Ulrich replied, you can simplify it all by setting the Unicode option in
your project, if you are using MSVC.
Other compilers may require your defines
#define UNICODE
#define _UNICODE
as the first lines of the first header;
if you do not use any header file, as the first lines of your c file.
Many names for the API functions are macros, and with UNICODE defined they
and the Windows structs are automatically set to the unicode versions.
This means: You need not write FindFirstFileW, FindFirstFile will mean the
same.
To keep it easy, use WCHAR always, no conversions needed.
int how_many_fichiers_unicode(WCHAR *wrep) {
int nbFic=0;
WIN32_FIND_DATA wfind; // This is WIN32_FIND_DATAW
HANDLE hh = FindFirstFile(wrep, &wfind);
// use INVALID_HANDLE_VALUE!
// Search msdn for FindFirstFile
if(hh == (HANDLE)INVALID_HANDLE_VALUE) return -1;
do {
if(wfind.dwFileAttributes & (FILE_ATTRIBUTE_HIDDEN |
FILE_ATTRIBUTE_DIRECTORY)) continue;
nbFic++;
MessageBox(NULL, wfind.cFileName, L"Test", MB_OK); // This is
MessageBoxW
} while(FindNextFile(hh, &wfind));
FindClose(hh);
return nbFic;
}
Regards,
Friedel
Friedel Jantzen
2010-04-08 19:27:23 UTC
Permalink
Hello Jean!
Post by Jean
Post by Friedel Jantzen
if you are using MSVC
yes, i use MSVC6 and pure C SDK.
i set UNICODE in the project and got 34765 warnings :-))
using WCHAR is easy, thank you for the tip.
i suppose that i lust use _wsplitpath instead of _splitpath, swprintf
instead of wsprintf, wcscpy instead of strcpy etc... ?
Yes, the wcs CRT string functions must be used, if the strings are Unicode.
This redesign can be a lot of work.
However, it is possible to use both ANSI and Unicode APIs together in the
exe, like in your original code posting. This is a mess, but possible (not
on Win9x).
If you want to to it that way, do not define UNICODE, but code the
WideChar-API-functions explicitely, as you did. But when you must pass the
strings to ANSI functions, you will run into problems with converting the
Unicode cyrillic strings (wchar_t, WCHAR) to ANSI (char)...
IMO there really is no way around Unicode, if e.g. Cyrillic must be
supported.
BTW, I suggest to provide a way for the user to select a font to ensure
cyrillic support.

Regards,
Friedel
r***@pen_fact.com
2010-04-08 19:05:06 UTC
Permalink
Post by Jean
Hello Friedel
int how_many_fichiers_unicode(char *rep) {
HANDLE hh;
wchar_t wrep[_MAX_PATH];
char full_name[_MAX_PATH]
int nbFic=0;
WIN32_FIND_DATAW wfind;
int ret, sz;
LPSTR ansi;
#define UNICODE
#define _UNICODE
clip
Post by Jean
MessageBox(NULL,ansi,"",MB_OK);
I'm surprised this line compiled. You have UNICODE and _UNICODE
defined, so MessageBox should be defined as MessageBoxW. But you've
passed an ANSI argument.

clip
Post by Jean
#undef _UNICODE
#undef UNICODE
return nbFic;
}
Jean
-----------------------------------------
To reply to me, remove the underscores (_) from my email address (and please indicate which newsgroup and message).

Robert E. Zaret, eMVP
PenFact, Inc.
20 Park Plaza, Suite 400
Boston, MA 02116
www.penfact.com
Useful reading (be sure to read its disclaimer first):
http://catb.org/~esr/faqs/smart-questions.html
Ulrich Eckhardt
2010-04-09 06:53:41 UTC
Permalink
[...]
Post by r***@pen_fact.com
Post by Jean
LPSTR ansi;
#define UNICODE
#define _UNICODE
clip
Post by Jean
MessageBox(NULL,ansi,"",MB_OK);
I'm surprised this line compiled. You have UNICODE and _UNICODE
defined, so MessageBox should be defined as MessageBoxW. But you've
passed an ANSI argument.
He first included the according header, where the MessageBox macro is
defined and only afterwards defined the UNICODE/_UNICODE macros. In other
words, these simply didn't take effect and MessageBox still resolves to
MessageBoxA.

Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
r***@pen_fact.com
2010-04-07 18:43:49 UTC
Permalink
Post by Jean
Hello
(pure C and SDK, XP/7)
I try to read a directory with cyrillic and western mixed file names.
I use FindFirstFileW, WideCharToMultiByte(CP_UFT8...
when i write the filename to a text file all is correct (with fwrite)
when i add the filename to a listbox i get garbage, idem if i use
MessageBoxA to display the filename
To expand a bit on an earlier reply:
Functions with names ending in "W" expect Unicode arguments. Functions
with names ending in "A" expect ANSII arguments. So you can't use the
same string in both. You can use WideCharToMultiByte and
MultibyteToWideChar to translate between the two types of string, but
then you have two strings (one Unicode and one ANSII).
Post by Jean
any idea ?
Jean
-----------------------------------------
To reply to me, remove the underscores (_) from my email address (and please indicate which newsgroup and message).

Robert E. Zaret, eMVP
PenFact, Inc.
20 Park Plaza, Suite 400
Boston, MA 02116
www.penfact.com
Useful reading (be sure to read its disclaimer first):
http://catb.org/~esr/faqs/smart-questions.html
Loading...