Saturday, July 14, 2012

A caveat on using %u with scanf for reading positive integers



Do you use scanf() class of functions? Do you use the %u conversion specifier to read positive integer values? Do you use the return value of scanf() as a confirmation of successful operation? If you do, then read on..

The task

Parse a string (e.g. "42") (or a field within a string) to read a positive decimal integer value and validate it by confirming that the string indeed is a positive integer (and not things like "abcd", "-42"). Use the C programming language.


The sscanf method

The scanf class of functions (scanf/sscanf/fscanf) supports several conversions for reading different types of values - integers, floats, strings etc. The conversions to be done on the input are specified by means of a format string argument. A conversion specification typically consists of the  '%' character followed by a character specifying the conversion to be performed. For example, "%d" is for reading 'int' values, "%u" for 'unsigned int' objects, "%s" for strings and so on.
The return value of these functions is the number of of input items successfully matched and assigned.

So the most straight forward solution is to use scanf() with %u conversion to read an 'unsigned int' and check its return value to confirm the operation was successful:

Listing-1:
char str[] = "42";
int ret;
unsigned int val;
   
ret = sscanf(str, "%u", &val);
if(ret == 1)
    printf("success: %u\n", val);
else
    printf("failed\n");
As such, the above code prints "success: 42" and seems to work fine. It also prints "failed" for non-numeric inputs.

Can you spot any problems with listing-1? For the time being, we can ignore the fact that it converts only the initial part of the string and will stop at the first invalid character. i.e. for inputs such as "42abcd" and "42.5", sscanf() will return 1 (indicating a successful conversion) and the converted value will be 42. This is a common nuisance when processing the last field of a string.

So, problem solved? Not quite!


The caveat


Testing the code in listing-1 with a negative integer as input produces a surprising (at least for me) result:

Listing-2:
char str[] = "-5";
int ret;
unsigned int val;
   
ret = sscanf(str, "%u", &val);
if(ret == 1)
    printf("success: %u\n", val);
else
    printf("failed\n");
The output of listing-2 (with gcc v4.7 on glibc/Linux) is:
"success: 4294967291".
i.e. even though the input is negative, sscanf returns 1 indicating that it made a successful 'unsigned int' conversion.

Is this really valid behaviour, or a compiler bug??

Unfortunately, the behaviour is perfectly valid; says the ISO C language standard. Here's what the standard (c99) says about the 'u' conversion for fscanf() (the emphasis is mine):

"Matches an optionally signed decimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 10 for the base argument. The corresponding argument shall be a pointer to unsigned integer."
Okay. So the input can be optionally signed and the rules of the conversion is essentially same as those for strtoul(), another C library function. To understand further, we must necessarily refer to the documentation of strtoul(). And so it goes:
If the subject sequence begins with a minus sign, the value resulting from the conversion is negated (in the return type).
There you have it. The behaviour of %u with scanf is identical to that of strtoul(). And strtoul() considers a negative number as a valid input. For such values, the function returns the negative number in its unsigned form (i.e. 2's complement for most platforms).

Because of this, when using scanf(), there's no direct way to know if the input really was positive.


The solution

If the acceptable range of inputs is representable by a signed int, then you can use %d instead of %u and use a plain 'int' object to hold the value. You must check the return value of scanf() and then check the value of the converted integer to be within the required limits (i.e. greater than zero and less than some upper limit).

However, if you need to read huge numbers that can be represented only by the unsigned int type, then you must use some other indirect method. One way would be to use %s to read the field as a string first, and then check if the first character is not a '-' sign. Once that is clear, use strtoul() to perform the conversion. One additional benefit of this method is that strtoul() will tell you if there were any additional invalid characters in the input - use the second argument to get a pointer to the first invalid character, which should be the terminating null character if the whole string was successfully converted. With this scheme, invalid inputs such as "42abcd" and "42.5" can be successfully caught.


Conclusion


Avoid using %u with scanf() class of functions. In most practical cases, the range of acceptable values will be sufficiently small to fit in a signed integer. So use %d to read a plain 'int' and then check if the value is negative.

The 'u' conversion is useful only if you really want to read in huge positive numbers that are not representable by a signed int.


So, check your code to see if you have any instances where %u is used to read positive integers and the return value of scanf() is used as a confirmation of valid input.

3 comments:

  1. Thank you!
    That was useful. I think it's silly behavior.
    I need huge numbers, so I read input to long long int first ("%ll"), then convert it to size_t.

    ReplyDelete
  2. ...in other words, the bunch of rabid monkeys with typewriters who designed the C standard libraries (...DON'T GET ME STARTED...) made the %u format specifier of scanf essentially useless. Just discovered this by being paranoid and testing with random input, glad to know I'm not the insane one here - yeah, real smooth, folks...

    ReplyDelete
    Replies
    1. Indeed.. Although I am not such a huge hater of C/C++ language/library design :-)

      This silly behaviour with %u made me check all my code and painfully change each instance.

      Delete