Event Name: Live Chat - Make Your Parallel Code Run Faster (LPW068569)

File Saved by: Sharon Troia

Date Time: Wed November 8 2006 12:02:35

Sharon Troia: (10:35) Hi, welcome to AMD’s Multi-core text chat!

Sharon Troia: (10:35) My name is Sharon Troia. I’ll be your host for this event. I work at AMD as the managing editor for the tech articles that we provide on Dev Central, developer.amd.com.

Sharon Troia: (10:35) We will get started at 11am, until then, hang tight :)

Sharon Troia: (10:50) Shortly before we start, we'll do a quick survey.

Sharon Troia: (10:54) You should all be seeing a survey along with the results.

Sharon Troia: (10:56) Wow, looks like we have a pretty even spread of experience here today so far.

Sharon Troia: (10:57) We'll be switching between the agenda slide and the survey for those people who are just joining.

Sharon Troia: (10:59) While people are joining, we are going to start out with a survey to find out what how experienced you are with multithreading.

Sharon Troia: (10:59) Please hold your questions until the survey is complete and we get started

Sharon Troia: (11:00) We are just about ready to get started. We will give participants one or two more minutes to get situated and then we will start the session.

Sharon Troia: (11:00) Please respond to the survey if you haven't already.

Sharon Troia: (11:02) Hi, welcome to AMD’s Multi-core text chat!

Sharon Troia: (11:02) My name is Sharon Troia. I’ll be your host for this event. I work at AMD as the managing editor for the tech articles that we provide on Dev Central, developer.amd.com.

Sharon Troia: (11:02) I’d like to introduce you to our experts Mike Wall and Richard Finlayson.

Sharon Troia: (11:02) Mike Wall is a Sr. Optimization Engineer with experience in working with software partners on their optimization and multithreading projects. He’ll answer your technical questions about multi-core and multithreading and have some slides for you to reference during this chat session.

Sharon Troia: (11:03) Richard Finlayson is the Director of our Developer Outreach Team and is here to talk about Multicore resources available for you through our developer program on developer.amd.com.

Sharon Troia: (11:03) One more thing I should mention before we get started - don’t forget we will be conducting a drawing to giveaway an AMD dual-core workstation! Make sure you registered with a valid email address, we’ll do the drawing and notify the winner shortly after the session ends.

Sharon Troia: (11:03) Ok, let’s get started!

Sharon Troia: (11:03) It looks like the majority of people are pretty skilled. Take it away, Mike!

Mike Wall: (11:03) > The main message is "multi-core is here to stay" and the trend is toward a larger number of cores. So developers who care about performance really need to design scalable multi-threaded code. Scalable multi-threading is the main take-away idea for this chat session.

Mike Wall: (11:04) > OK, so why hasn't everyone already threaded their performance-critical code? Lay it on me...

Michael Lewis: (11:05) rewriting 300KLOC is not my idea of a fun weekend :-)

David Shust: (11:05) Who says I havn't?

Mike Wall: (11:05) Re: Who says I havn't?
hee hee

Douglas Campbell: (11:05) We have, we're looking for cheaper hardware to do what we use NUMA stuff for now

zirani jean-sylvestre: (11:05) the slide show we are going to have more and more core; do you think developers should plan on writing apps running on > 8 cores ?

Mike Wall: (11:06) you will likely see >8 cores in the coming years, yes

Fabrice Lété: (11:05) I am scared of debugging getting a lot harder

Frank Morales: (11:06) i agree w/ the debugging.

Veikko Eeva: (11:06) Writing multithreaded software with current tools is a rather difficult task. Especially since it's not all the clear how to find the performance critical parts of the code that can easily to be multithreaded.

Stephen Lacy: (11:06) where can I find documentation to answer architectural questions such as the mechanism for cache-to-cache communication and the latencies involved?

Richard Finlayson: (11:07) Re: We have, we're looking for cheaper hardware to do what we use NUMA stuff for now
>What platform are you on, what plaforms will best meet your needs based on your NUMA comment?

Mike Wall: (11:07) all the major debiggers have some degree of threading support, but yes, it's harder. AMD CodeAnalyst supports some multi-thread perf analysis

zirani jean-sylvestre: (11:07) do you think (co)processors with let say 512 or 1024 cores to be a standard in > 4 years ?

robert marshall: (11:07) i'm still wedded to tules like vtune to show me where i need parallilsm; i go to that section and attack that area of the problem. i want better tools; better languages.

Douglas Campbell: (11:07) SGI. We've actually test ported to an athlon 64 dual core and we are currently seeing better performance that on our SGI equipment.

Mike Wall: (11:08) yes, you need to profile and look for opportunities for data-parallel threading

Richard Finlayson: (11:08) Re: where can I find documentation to answer architectural questions such as the mechanism for cache-to-cache communication and the latencies involved?

>Stephen, check out AMD Developer Central resources include:
Documentation, developer tools, product information, AMD64 ecosystem information, technology features & benefits, and in-depth technical documents on all topics related to software development on AMD64.

Mike Wall: (11:09) > Clean separation of data workloads is critical; don't pass a lot of data between threads. Also, as in single-threaded programming, try to read data only once from memory, and do all relevant processing while it's in cache. Work on small blocks if necessary.

Sharon Troia: (11:09) The slides being shown are only for reference. This is a text based chat event with no audio. All questions will be answered through the text chat window.

David Shust: (11:09) If a cache line is written by one core, is that memory updated in the other cores' caches only if they access the same memory location?

Mike Wall: (11:10) see the slide

Mike Wall: (11:10) there is real sharing, when threads write the same location... and "false sharing"

Veikko Eeva: (11:09) It'd nice to see how the C++0X will implement its threading model. Does anyone have a comment regarding this? Like tool support with C++ and threads. :)

Mike Wall: (11:12) I don't understand the question, sorry

Michael Lewis: (11:09) Can you recommend any strategies for dealing with situations where the majority of operations in a data-parallel model can be done atomically, but some require dependencies?

Mike Wall: (11:09) > When possible, use libraies that already implement multi-threading. Don't reinvent the wheel (unless you really want to!) Implement data-parallel multi-threading, for best scaling to N cores. See the slides for more details. Test on bleeding-edge hardware, i.e. 4x more cores than your current customers use. Testing and optimizing desktop software on a 2-socket or 4-socket workstation is a really good idea!

Douglas Campbell: (11:09) Shust's question is important -- cache coherency

Mike Wall: (11:12) yes, any time someone writes to a location, other caches get that line flushed if they are storing that data.

Richard Finlayson: (11:10) Re: where can I find documentation to answer architectural questions such as the mechanism for cache-to-cache communication and the latencies involved? Also, check out Architecture Programmers Manuals. A set is avialalbe to anyone who views and particiipates in the Multicore Video Roundtable. The Multicore Video Roundtable includes five video chapters, with a forum feature to allow you to discuss multicore issues. You will find this on the main page at AMD Developer Central

Douglas Campbell: (11:10) ach, the slides are coming too fast!

Stephen Lacy: (11:11) Richard, the programmers manuals I have downloaded have scant references to multi-core. But perhaps there's a set of manuals I have missed?

zirani jean-sylvestre: (11:12) Mike; malloc does contain "hidden" locking does it ? this may have inpact on performances

Richard Finlayson: (11:12) Re: Richard, the programmers manuals I have downloaded have scant references to multi-core. But perhaps there's a set of manuals I have missed? Stephen, updates are published regularly. Multicore content will continue to be added. Updates can be expected at least twice a year.

Dick Dunbar: (11:12) So cache line is 64 bytes ?

Veikko Eeva: (11:12) Or is malloc basically a lock-free algorithm? I could imagine it could be implemented that way.

Douglas Campbell: (11:13) I note cache alignment to avoid cache thrashing -- is there anything to do to prevent cache invalidation between processors -- such as not aligning on a certain number of low order bits as in the SGI case?

paul cheyrou: (11:13) where I can an exhaustive comprehensive "standard rule of threaded programming" where I can point othere developper at ? (my knowledge is from lot of resources and personnal experience), in a team, make people use threaded programming require common basis... so "standard" is what I'm indeed looking for...

Larry Wang: (11:13) Yee, I lost front paret of this chat, and ity's kind fats - can we have the whole session to be replayed somehow later after chat.

Sharon Troia: (11:13) Re: Yee, I lost front paret of this chat, and ity's kind fats - can we have the whole session to be replayed somehow later after chat.
>We are going to post transcripts - and possibly even a playback of the slides.

Dick Dunbar: (11:13) malloc is system dependent. AIX has a lockfree algorithm ... others do not.

Mike Wall: (11:14) the L1 cache is 2-way associative, so data that is addressed modulo-32K can only live in 2 places... avoid lots of "exactly 32k" blocks

Douglas Campbell: (11:14) Thanks Mike

Mike Wall: (11:15) this is a separate issue from coherency, though... two threads should avoid sharing the same 64-byte cache line whenever possible

Stephen Lacy: (11:15) What is the latency for cache-to-cache communication? I.e., same data in both L1 caches, then one core does a write, and the other core does a read to get the new data.

Mike Wall: (11:17) I don't know the exact number, but it's longer than accessing you own core's cache for sure

Chris MacGregor: (11:15) Mike, can you say more about that? ('so data that is addressed modulo-32K can only live in 2 places... avoid lots of "exactly 32k" blocks')

David Shust: (11:15) If I write 3 sets of 3 mmx registers (temp vars), each set contiguously, starting on 256byte boundaries, but skip the 4th 64byte value of each set, meaning I only write at the beginning of cache lines, but do not fully write the cache lines, do they never go into first level cache? Can these "variables" actually just remain in the processor proper, in the cache line assembling hardware? I only need 3 cache lines for this. All my actual memory output is streaming, bypassing cache.

Sharon Troia: (11:15) Re: re: developer.amd.com: I have had a very hard time finding detailed info on the different processors available. Particularly, as I spec systems that will be used for scientific computing, I want to compare things like cache (L1, L2, per-core, total, etc.), cores, core interconnect, clock speed, FSB, etc., but it's a huge pain (if possible at all) to find the info I want. Someone go kick the marketing people - I can't get 'em to buy AMD instead of Intel if they get a clear (misleading or not) story from Intel but confusion from AMD.
> Check out the Opteron Comparison webiste, this should give you the info you are looking for

Mike Wall: (11:16) look at the optimization guide, on developer.amd.com for more info about this

Chris MacGregor: (11:16) Sharon: does the opteron comparison cover the Athlon X2, FX-nn, etc. processors?

Chris MacGregor: (11:16) part of the problem was the difficulty in comparing Athlon XP (at the time) to FX-nn to Opteron, etc.

Sharon Troia: (11:17) Re: Sharon: does the opteron comparison cover the Athlon X2, FX-nn, etc. processors?
>Yes, you can compare dual-core with opteron

Miguel Pedro: (11:17) What strategy do you recommend for best performance with N threads: let the OS take care of scheduling each thread to a specific code dynamically or should I set the thread affinity myself?

Mike Wall: (11:19) see the slide, you can gain perf in memory-intensive apps using NUMA if you're careful

Veikko Eeva: (11:17) This is something I'd like to know also.

Mike Wall: (11:19) Check the slide, allocating local memory can be done

Mike Wall: (11:18) cache lines are always treated as an "atomic" unit. Streaming stores avoid allocating any cache line,

Richard Finlayson: (11:18) Whiile Mike answers Miguel and Veikko's question, allow me to highlight AMD's Developer Central resources. AMD Developer Central is AMD’s online resource to support and engage software developers of all interests on AMD64. The “jewels in the crown” so to speak, are the depth of technical documentats on a wide variety of development related issues as well as free downloadable tools that enhance performance and allow for straightforward optimization of your applications.

Douglas Campbell: (11:19) Of course, us *nix folk are thrilled by these slides...

Mike Wall: (11:19) sorry 8-/

Mike Wall: (11:20) the *nix support NUMA also

Michael Lewis: (11:20) in my situation I have a lot of data points which are all updated in discrete time steps; usually each point can be updated without accessing any other data point, but there are cases where a point may have to look at its "neighbors" to be updated in a step. Can you recommend any algorithmic (or low-level) strategies for minimizing the cost of these dependent updates?

Stephen Lacy: (11:20) I've had a very difficult time finding latency and performance numbers in AMDs docs. This is a big barrier to investigating the use of multithreading -- you can't construct a good parallelization strategy without knowing the communication overheads. Do the Developer Central resources have concrete numbers on questions such as these (e.g., cache-to-cache latency).

Veikko Eeva: (11:20) Does virtualization have a significant impact on writing multicore aware code?

Sharon Troia: (11:23) Re: Does virtualization have a significant impact on writing multicore aware code?
>As long as you use the OS methods to determine how many cores are there, then writing mulitcore code should be fine on a virtualization machine

Richard Finlayson: (11:20) You can give us feedback on Dev Central resources via this email address:

Douglas Campbell: (11:20) Mike, of course -- where do you think it all started?

Mike Wall: (11:21) can you double-buffer your entire data set, so it's "read only" ?

Dick Dunbar: (11:21) Are there "touch" instructions that allow committed memory, without making the cacheline dirty?

Mike Wall: (11:21) indeed

Mike Wall: (11:21) no

Michael Lewis: (11:21) would be prohibitive - we're talking millions of data points, i.e. close to the 4GB practical addressing limits

Mike Wall: (11:22) work a chunk at a time, special case for overlapping areas?

David Shust: (11:21) Mike: I understand that when I dump my results I will not be using cache lines. So my question is, if I'm just continually using 3 sets of 3 contiguous 64byte values, will these data just stay in the cache lines? Or do they get written and read to the cache? This is for a real time fractal program that recalculates about about 20million triangles per second, per core.

Mike Wall: (11:22) they say in cache line

Dick Dunbar: (11:22) Mike: Was that "no" directed to my "touch" question?

Mike Wall: (11:23) yes, sorry

Michael Lewis: (11:22) sure, but the catch is it is not really feasible to predict where overlaps will occur (hindsight of course is easy)

Larry Wang: (11:23) It looks like that I am a beginner to all these multi-core tech's though recently I purchased two dual core AMD 64-bit Opteron workstations. I would like find out the introductory for parallel processing archtectures/techniques so I may later fully explore its capabilities.

Richard Finlayson: (11:23) Re: It looks like that I am a beginner to all these multi-core tech's though recently I purchased two dual core AMD 64-bit Opteron workstations. I would like find out the introductory for parallel processing archtectures/techniques so I may later fully explore its capabilities.

Miguel Pedro: (11:23) Thanks Mike

Michael Lewis: (11:23) I realize the case is a bit vague but I'm casting around for some kind of lock-free method that can chunk through the main cases, and then queue the collisions for special processing later; but this may be a language-level problem rather than architecture (i.e. backing out when we detect collision is not expressible in C++)

Mike Wall: (11:25) we don't have any special HW help for you

robert marshall: (11:24) I design commodity clusters for HP scientific Grand Challenge apps such as Weather Forcasting. Multi-cores mean cheaper clusters, but my users are used to thinking in terms of FORTRAN loops to match cache strides and having MPI abstract away their inter processor communications. The jump to threading is an architectural leap that I will somehow have to solve for them unless new tools or libraries or languages extensions ie to FORTRAN will come about.