## Wednesday, September 22, 2010

### Python and binary data - Part 2

In the previous post, I discussed about numbers (floating, signed and whole) representation on a computer. In "C", the bits used for representation are limited. This means that there is inherent limitation on what can be represented. It also means that there is danger of overflow when you can't hold all the bit values after an operation to represent a number that exceeds the bit limits.

How about Python? How does it represent these numbers?

Even if Python underlying implementation is in "C" or "C++" types, Python integers are not like typical "C" integers.
Python integer have arbitrary precision.
It creates a higher level of abstraction for number representation.
Python represents integers by allocating the memory that is required to hold the number value. Initially, the size is set to 32 or 64 bits and increased when required. They can pretty much store a very large value or an astronomical figure. Arbitrary precision operations on these integers can be very slow.

How ever, Python floating point numbers are different. They are essentially same as "C" floating point numbers. Python’s floating point types are implemented in terms of C’s double type.
They are limited and does not provide arbitrary precision.
There are special floating point values defined that deals with this limitation. They are called "inf" for representing infinity and "NaN" to represent - not a number.

consider this:
>>> x = 1e200
>>> y = x*x
>>> y
inf
>>> z = y/y
>>> z
nan


This discrepancy in integers and floating point number representation means that one has to be careful about various arithmetical operations on numbers. We can carry out precise operations on integers as their precision is arbitrary large. How ever, if we mix them up with floating point numbers, the result can be unpredictable.

>>> x = 10**200
>>> y = (x + 0.5)*x
>>> y
inf
>>> z = x*x + 0.5
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: long int too large to convert to float
>>>


Basically, this means that you can not convert a very large integer to a limited floating number representation without being aware of this limitation.

Representation in various bases
An integer can be represented as binary , hex , octal and decimal numbers in Python.
>>> x = 0x100
>>> x
256
>>> y = 0o100
>>> y
64
>>> z = 0b100
>>> z
4

>>> bin(4)
'0b100'
>>> hex(16)
'0x10'
>>> int('0b100', 2)
4
>>> int('0x10', 16)
16
>>>

Bit wise operation on binary numbers - And, Or, Xor and Not
>>> 0b110 & 0b001
0
>>> 0b110 | 0b001
7
>>> bin(7)
'0b111'
>>> 0b110 ^  0b001
7
>>> ~ 0b001
-2
>>> bin(-2)
'-0b10'
>>> int('0b001', 2)
1
>>> ~ 1
-2
>>> ~ 0b0110
-7
>>> bin(-7)
'-0b111'


Shift operation
Left shift is equivalent to multiplying with 2 and right shift is equivalent to dividing 2.
>>> k = 0b100
>>> k
4
>>> k << 1
8
>>> k << 2
16
>>> k = 4
>>> k << 1
8
>>> k >> 1
2
>>> bin(k << 20)
'0b10000000000000000000000'
>>>

Setting and clearing bits

To set the ith bit of x to 1

• Create a value mask in which bit i is 1 and all others are 0
• Use x = x | mask

To set the ith bit of x to 0:

• Create a value mask in which bit i is 1 and all others are 0
• Negate it using ~, so that the ith bit is 0, and all the others are 1
• Use x = x & mask

>>> x = 0b1010101
>>> x = x | mask
>>> bin(x)
'0b1110101'
>>> x = x & mask
>>> bin(x)
'0b1100101'
>>>


Bit Flags

You could either store separate boolean values or combine these into a bit vector with each bit representing a particular flag. This process may be slow but saves on space.
So instead of saying x , y , z as boolean flags that can be set to True or False, you can use a a bit vector with 0th bit representing flag x, 1st bit representing flag y and 2nd bit representing flag z.

Thats is 0b0xyz with x, y, z having values in (0,1)  : X = 0x01, Y = 0x02, Z = 0x04

Then it is easy to check on individual flags:
>>> X = 0x01
>>> Y = 0x02
>>> Z = 0x04
>>> T
3
>>> bin(T)
'0b11'
>>> T & X
1
>>> T & Y
2
>>> T & Z
0