When copying data which is not aligned by data size which is more important to performance, the alignment of the source addresses or the destination addresses? Put another way, should I first align the source pointer or the dest pointer before doing the bulk of the copying?
Also, does the relative advantage of source vs. dest alignment hold true for any x86 CPU, or does it vary from generation to generation?
Thanks.
Check the memcopysse2/code location sensitivity thread (http://www.masm32.com/board/index.php?topic=11454.msg87600#msg87600).
QuoteIntel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 733 735 608 608 615 872 732
2048, d1s1-0 1100 821 649 649 643 653 4299
2048, d7s7-0 995 827 654 661 649 654 4324
2048, d7s8-1 1262 1339 1207 870 618 621 4319
2048, d7s9-2 1262 1341 1218 872 619 611 4340
2048, d8s7+1 1244 1333 1188 1213 620 916 1229
2048, d8s8-0 980 819 656 655 659 655 984
2048, d8s9-1 1228 1347 1210 870 613 621 1229
2048, d9s7+2 1584 1334 1176 1208 613 932 4029
2048, d9s8+1 1587 1333 1176 1209 618 929 4020
2048, d9s9-0 1101 821 660 659 659 661 4040
2048, d15s15 766 825 654 661 661 651 4031
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 556 566 363 363 373 363 560
2048, d1s1-0 1047 619 418 420 444 420 1723
2048, d7s7-0 567 619 419 421 446 421 1744
2048, d7s8-1 1474 1515 1090 441 962 965 1535
2048, d7s9-2 1473 1522 1090 448 970 975 1759
2048, d8s7+1 1464 1309 1090 698 817 822 1465
2048, d8s8-0 556 619 421 423 448 423 560
2048, d8s9-1 1465 1522 1083 441 961 962 1467
2048, d9s7+2 1481 1309 1081 765 824 832 1804
2048, d9s8+1 1481 1309 1081 696 818 821 1510
2048, d9s9-0 1047 619 421 423 448 423 1724
2048, d15s15 567 619 423 425 446 424 1718
Intel(R) Celeron(R) CPU 2.40GHz (SSE3 - jj Desktop)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 744 746 605 609 602 605 746
2048, d1s1-0 1098 827 657 657 647 653 4058
2048, d7s7-0 1004 824 662 658 658 662 4301
2048, d7s8-1 1240 1322 1222 1185 701 702 4285
2048, d7s9-2 1243 1336 1222 1221 697 694 4050
2048, d8s7+1 1214 1316 1190 1219 606 917 1216
2048, d8s8-0 980 830 663 667 666 667 981
2048, d8s9-1 1210 1334 1209 893 609 610 1212
2048, d9s7+2 1587 1316 1178 1282 694 989 4305
2048, d9s8+1 1586 1316 1178 1308 702 968 4304
2048, d9s9-0 1098 831 660 663 661 660 4305
2048, d15s15 753 831 675 675 674 675 4340
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4, BlackVortex (http://www.masm32.com/board/index.php?topic=11454.msg87622#msg87622))
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 1647 360 210 210 209 165 367
2048, d1s1-0 1649 396 259 266 258 219 1876
2048, d7s7-0 1631 402 265 261 261 218 1876
2048, d7s8-1 2159 1332 861 493 658 670 1439
2048, d7s9-2 2188 1338 862 493 658 666 1906
2048, d8s7+1 2151 1328 855 829 701 700 1393
2048, d8s8-0 1639 402 267 262 268 236 365
2048, d8s9-1 2205 1333 854 493 658 667 1290
2048, d9s7+2 2151 1329 849 828 700 701 1877
2048, d9s8+1 2143 1330 849 830 701 701 1471
2048, d9s9-0 1642 403 266 262 268 235 1884
2048, d15s15 1642 404 270 264 265 235 1893
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (SSE4, Ramguru (http://www.masm32.com/board/index.php?topic=11454.msg87601#msg87601))
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 1985 191 204 205 203 175 196
2048, d1s1-0 1992 223 251 248 251 222 244
2048, d7s7-0 1982 227 253 250 248 223 243
2048, d7s8-1 1976 246 592 501 193 211 243
2048, d7s9-2 1986 247 594 501 194 212 243
2048, d8s7+1 1982 249 587 501 193 179 244
2048, d8s8-0 1976 229 255 253 251 226 243
2048, d8s9-1 1983 249 588 502 194 211 243
2048, d9s7+2 1982 248 583 501 193 179 243
2048, d9s8+1 1977 249 583 501 193 179 243
2048, d9s9-0 1984 229 255 253 251 226 244
2048, d15s15 1974 228 256 256 258 226 243
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3, Mark Jones (http://www.masm32.com/board/index.php?topic=11454.msg87609#msg87609))
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 2079 551 359 424 424 359 547
2048, d1s1-0 2084 613 410 473 473 410 1060
2048, d7s7-0 2080 598 412 474 474 411 1059
2048, d7s8-1 2156 853 1016 564 567 569 802
2048, d7s9-2 2162 859 1016 564 567 568 1058
2048, d8s7+1 2172 849 868 564 565 566 804
2048, d8s8-0 2082 603 404 465 465 402 547
2048, d8s9-1 2177 848 995 564 565 568 803
2048, d9s7+2 2156 855 862 581 565 580 1060
2048, d9s8+1 2167 869 878 564 567 566 803
2048, d9s9-0 2090 595 412 472 472 409 1060
2048, d15s15 2064 592 410 470 486 408 1060
Intel(R) Pentium(R) 4 CPU 2.40GHz (SSE2: lddqu not possible, used movdqu)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 660 665 506 505 503 506 667
2048, d1s1-0 1088 712 554 548 550 555 3291
2048, d7s7-0 896 712 572 567 552 576 3288
2048, d7s8-1 1492 1630 1385 1134 1411 1396 3254
2048, d7s9-2 1492 1694 1385 1239 1415 1432 3992
2048, d8s7+1 1876 1565 1391 2085 1335 1377 1890
2048, d8s8-0 895 714 562 559 562 566 902
2048, d8s9-1 1516 1693 1374 1167 1383 1364 1532
2048, d9s7+2 2298 1567 1365 2220 1341 1323 3300
2048, d9s8+1 2298 1564 1363 2169 1382 1319 3261
2048, d9s9-0 1090 712 564 560 559 559 3327
2048, d15s15 683 709 565 575 578 564 3337
swsnyder,
the problem varies with the processor you use, misaligned reads and writes were more a problem on older processors, if you need to write higher speed misaligned copy routines there are SSE instructions that will do that. They are slower than the 16 byte aligned versions but not by that much. If you can align one or the other which is usually possible, try out source first and destination later. There is another possible combination, perform misaligned reads and use aligned non-temporal writes.