News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Program optimization

Started by StarShaper, September 08, 2010, 01:18:54 AM

Previous topic - Next topic

StarShaper

Hi!

I'm trying to optimize a C program and experienced some interesting stuff. It seems that the following function performs better

unsigned char func(unsigned char a)
{
    return (a ? secTable[(firstTable[a] + 0xff) % 255] : 0);
}


than this function:

unsigned char func(unsigned char a)
{
    switch(a) {
        case 0: return 0;
        case 1: return 1;
        case 2: return 2;
        case 3: return 3;
        case 4: return 4;
        case 5: return 5;
        case 6: return 6;
        case 7: return 7;
        case 8: return 8;
        case 9: return 9;
        case 10: return 10;
        case 11: return 11;
        case 12: return 12;
        case 13: return 13;
        case 14: return 14;
        ...   
        case 256: return 56;
    }
}


I dont know what optimizations the compiler does in the Release build, but I would like to know if some assembly experts could explain this issue?

It seems to me that the second function produces a lot of compares and therefore slows down the app, right?

hutch--

Hi Star,

If all you want to do is input one number and get another in return a table is the right way to go in that it has the same overhead for every member. I know you can do this in C with no problems as well as in MASM so you can more or less do what you like. The limitation of the table method is you have to have a table member for every entry in the range of numbers you want to change. 0 to 255 means a 256 member table.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

StarShaper

Hi hutch--,

that is exactly what I did, creating a 256 byte large table.  :wink

clive

Depends what the compiler does, most likely it generates a jump table rather than doing any/many compares. If the switch/case was sparse, depending on the compiler it would do a lot more compares, or implement a binary tree of comparisons.

Looks like you are doing some math on a Galois Field (Multiply, Log, AntiLog) of 2^8-1 length, or encryption (s-box). The most effective way would be to create a third byte table, and just index into that.

"case 256:" wouldn't be valid for an "unsigned char"
It could be a random act of randomness. Those happen a lot as well.

alp

    template<int i>
    inline int Ret()
{

}

    template<>
    inline int Ret<1>()
{
return 1;
}

    template<>
    inline int Ret<2>()
{
return 2;
}



    template<>
    inline int Ret<3>()
{
return 3;
}
    .
    .
    .
    template<>
    inline int Ret<256>()
{
return 56;
}





StarShaper

Quote from: clive on September 08, 2010, 06:02:49 PM
Depends what the compiler does, most likely it generates a jump table rather than doing any/many compares. If the switch/case was sparse, depending on the compiler it would do a lot more compares, or implement a binary tree of comparisons.

It's the Visual Studio 2010 Compiler. However, the switch statement ist pretty slow. It seems to me that the compiler doesn't generate a jump table.

Quote from: clive on September 08, 2010, 06:02:49 PM
Looks like you are doing some math on a Galois Field (Multiply, Log, AntiLog) of 2^8-1 length, or encryption (s-box). The most effective way would be to create a third byte table, and just index into that.

You are a smart guy. Thats right, multiplying in GF(2^8).  :wink

I'm amazed how fast Java's cryptographic libraries are. It seems to me, that they are implemented in Assembly.  :dazzled:

Quote from: clive on September 08, 2010, 06:02:49 PM
"case 256:" wouldn't be valid for an "unsigned char"

Thats right, just a spelling error.

clive

I'd have to see how MSVC treats some similar code, not normally how I'd attack it. Certainly on some of the embedded compilers the switch/case solution would create non-optimal results, and probably be bigger than 256 bytes to boot.

Table driven methods work so much better, and also work very effectively on micro-controllers (and 4.77 MHz PCs for that matter). I've used table driven methods for Reed-Solomon error correction, CRC and AES encryption software, and played with polynomials, CRC and GF() stuff in hardware too.
It could be a random act of randomness. Those happen a lot as well.

StarShaper

Quote from: clive on September 08, 2010, 08:20:21 PM
I've used table driven methods for Reed-Solomon error correction, CRC and AES encryption software, and played with polynomials, CRC and GF() stuff in hardware too.

Right, table driven methods are by far the fastest way to implement such kind of algorithms. I did some testings with bit shifting and so on, but a table with offset access is a lot better. Of course you have to calculate the tables first, but this isn't really a problem.

I still have some bottlenecks, my program needs 11 sec to encrypt a 150 MB file. WinRar needs only 5 sec. Most probably there are some issues with reading and writing the file. Maybe it would be better to read the file completely into memory and then start encrypting.

clive

I don't think I'd bother to read it all in. I'd say chunk it out in 32 KB pieces (good compromise of sector, cluster sizes and burst read lengths) and see how that works. Depends a lot on the file system and the drives, I think in previous timings 64 KB or 128 KB was about the sweet spot. Timing/benching your own application is the best way to go.
It could be a random act of randomness. Those happen a lot as well.

StarShaper

Did some further research. Crypto++ is by far the fastest implementation of the Advanced Encryption Standard. It encrypts a 150 MB file in less than 1 second.

The reason for this is AES-NI. I looked into the source code, it uses the macro CRYPTOPP_BOOL_AESNI_INTRINSICS_AVAILABLE.

A further search with Google took me to the Intel site about AES-NI.

http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/

This is a new instruction set for the 32nm Intel processor family codename Westmere formerly Nehalem-C. It consists of 4 instructions (AESENC, AESENCLAST, AESDEC, and AESDELAST) facilitate high performance AES encryption and decryption, and the other two (AESIMC and AESKEYGENASSIST) support the AES key expansion.

// performs one round of AES encryption
#include <wmmintrin.h>
#include <stdio.h>

int main()
{
    __m128i a;
    __m128i res;
    __m128i key;

    a.m128i_u64[1] = 0x8899AABBCCDDEEFF;
    a.m128i_u64[0] = 0x0123456789ABCDEF;
    key.m128i_u64[1] = 0x0022446688AACCEE;
    key.m128i_u64[0] = 0x1133557799BBDDFF;

    res = _mm_aesenc_si128( a, key );

    printf_s("Original data: 0x%016I64x%016I64x\n",
        a.m128i_u64[1], a.m128i_u64[0]);
    printf_s("Encoded data: 0x%016I64x%016I64x\n",
        res.m128i_u64[1], res.m128i_u64[0]);

    return 0;
}


Now I don't have to implement any kind of GF(2^8) math...  :lol

hutch--

A couple of things, if you are going to use a lookup table, test the difference between a BYTE table and a DWORD table, it may show you some surprises in timings. RE read ahead buffer size, I differ here in that I see bigger is better to a finite boundary so i would happily start with a buffer as small as 1 meg and just keep making it bigger if it showed speed improvements.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

StarShaper

Quote from: hutch-- on September 09, 2010, 04:36:27 AM
A couple of things, if you are going to use a lookup table, test the difference between a BYTE table and a DWORD table, it may show you some surprises in timings.

Already thought about that. I wouldn't be surprised if a DWORD table would be faster, since the processor operates on machine register sizes. However, I already implemented the code in C with BYTE, because it is easier to understand the code.

Now, with the new Intel instruction set, it becomes pretty easy to implement the AES. It is possible to calculate a complete round in a single instruction.

I will maybe use SSE (SIMD) to optimize some parts of the code. However, I don't want to end with assembly code. The primary goal is to implement the algorithm in C.