Other miscellaneous issues/problems noticed in the UPC specification

Dan Bonachea <>

* Is there a public list of proposed specification revisions & clarifications somewhere? Could we create one? (this should give you a good start)

Changes approved in the specifications working group

* See attached list of issues regarding UPC locks (approved by the specifications working group)

* upc_phaseof(shared pointer with indef. blocksize) - There's been a good bit of discussion on the correct semantics for this call (it's ambiguous in the current spec). I won't revisit all the arguments here, but it seems to be somewhat a consensus that it should be defined to return zero - this doesn't affect the expressive power of the language, but it provides standardization (which is good) and enables a few optimizations that we feel could be important. The specifications working group at the Washington meeting approved this change.

6.3.4 existing text: “If the shared array is declared with indefinite block size, the result of the shared pointer arithmetic is identical to that described for normal C pointers in [ISO/SEC00 sec. 6.5.6], except that the thread of the new pointer shall be the same as that of the original pointer”

Proposed addition: “and the phase field is defined to always be zero”

* upc_phaseof(NULL) and upc_threadof(NULL) should return 0 – not currently addressed in the spec.

Proposed addition (to section 6.3.4 or 7.3.5/7.3.6): The thread field and phase field for a NULL shared pointer are defined to be zero.

* 7.3.2 - upc_all_alloc() states that memory remains live "until _all_ threads have deallocated the object" - this seems contrary to what users expect and means that memory allocated with all_alloc behaves "differently" with respect to upc_free() than memory allocated with local_alloc or global_alloc. It also seems contrary to the semantics of upc_free(), which simply claims that "upc_free() frees the dynamically allocated shared memory" and "if the space has been deallocated by a previous call to upc_free, the behavior is undefined". In any case, the spec is ambiguous about how it should behave.

We can see cases where the "everyone deallocates" might be handy, but these semantics seem to have much greater potential for harm than good - imagine a program that creates a large graph in shared memory where some objects are created with all_alloc and others with local_alloc. Now the user has to explicitly track which objects are which in order to correctly free the graph (adds a new level of complexity to the traditional problems with explicit deallocation). Note that most or all of the cases where you might want to call upc_free from all threads can be trivially transformed by replacing them with {upc_barrier; if (MYTHREAD==0) upc_free(ptr);}

Proposed change approved in the working group:

7.3.2, old text: “… until all threads have deallocated the object”

7.3.2, new text: “… until any thread has deallocated the object”

Other issues brought up in the specification working group (by people other than myself)

Kathy and others should be sending you the full notes on this, I’ll just list the issues here

  • the phase field for the pointer to the beginning of a shared array object should be 0
  • atomicity of shared reads/writes – “tearing” and “clobbering” values
  • change “shared pointer” to “pointer-to-shared” everywhere in doc
  • upc_strict compound statement (postponed)
  • block sizes on scalar (shared [3] int x;)
  • externally-generated signals should have undefined semantics in UPC programs
  • extend definition of upc_fence to say that it polls the network (on systems where this is required to prevent deadlocks caused by spinning on local pointers to shared data)
  • eliminate the 5 “legacy” macros defined in 7.1
  • clarify that: int [*] vals[some_big_value_violating_max_blocksize]; is a compile error
  • Problems with linking non-UPC libraries (containing static data) with UPC programs, especially with threaded-based UPC implementations – esp. stdin/stdout/stderr and non-threadsafe libraries

Other Issues/Questions, not brought up in the working group

(in order by section)

* 5.1.2.2 - what are the semantics of C termination functions like abort() in UPC? Should they terminate the entire job or just a single thread?

* 5.1.2.3 - in the second bullet point, "e" is an unbound variable and this makes the meaning unclear. Is e supposed be "forall e in threads"?

* 6.2.1&2 - "non-modifiable lvalue" is not enforceable in C unless you prohibit the & operator on these quantities.

* 6.3.1&2 - We should explicitly disallow: upc_localsizeof(*p) and upc_blocksizeof(*p) where p is a shared pointer - *p is a unary expression which denotes a shared object, but the blocksize/localsizeof *p is statically undecidable in general.

* 6.3.2 - What does upc_blocksizeof() return when called on a non-array type? Is it illegal?

* 6.3.5 - casts from shared pointer to local pointer - is this type conversion implicit on assignment and function call? Or must the cast expression be explicitly specified?

* 6.4.2 – “The block size is a part of the type compatibility” – this is a very powerful statement, which should probably be clarified with a “for example this implies…”. Some important implications I’ve encountered, which may not be immediately obvious to a casual reader:

1. You cannot subtract 2 shared pointers with different block sizes (without first casting one)

2. If T and S are shared pointer types with different block sizes, then it is not legal to alias a T* with a S*

3. sizeof(shared [1] char *) may differ from sizeof(shared [] char *) or (shared void *)

(by the way, we plan to exploit #2 and #3 in our compiler by using different shared pointer representations based on block size)

* 6.5.1 - barrier semantics - point 4 and point 9 seem to be contradictory. I have no idea what you meant by this.

* 6.5.1 - barrier values – do “not given” values match “given” values? or only other “not given” values?

* 6.5.2 - semantics of upc_forall - point 8 specifies "If any iteration of a upc_forall statement (loop body or control expressions) produces a side-effect needed by another iteration of the same upc_forall statement, the result is undefined" - however, even the most common case control expression idiom ( upc_forall(i=0;i<N;i++;continue) {...i...} ) has control expressions that produce a side effect needed by the control expressions (and often loop bod) in the subsequent iterations of the loop (i.e. incrementing the induction variable). I'm not sure what the intention of this point was, but this clearly wasn't it. If you intended to say that iterations of the loop body executing on different threads (as governed by the affinity parameter) have no synchronization relationship to each other, then explicitly say so. If you meant that different iterations of the loop may be arbitrarily reordered with respect to each other (a much stronger statement), then you should definitely clarify that.

* 7.3 - What is the out-of-memory behavior of the allocation functions? Implementation defined error? Return null pointer? If the former, it seems better to state the “result is undefined” than to leave it silently unspecified.

* 7.3.2 add the constraint: "the nblocks and nbytes actual parameter values passed to this function for a given call must be the same on all threads"

* 7.3.2 and 7.4.4: “implied synchronization before all threads execute the function call” – what does this language mean? Does it mean that these calls must act as a barrier (i.e. “implied barrier synchronization”)? Or does it simply mean all threads must execute the call during the same synchronization phase, with corresponding calls across threads executing in a fixed total order? (this weaker property permits more optimized implementations – ie P0 need not wait for Pn to reach the call before proceeding to do other work. Simple implementations are still permitted to include a barrier call, if necessary for correctness). I’d prefer the second semantics, but either way, we need to make it clear to users whether or not they can rely on barrier synchronization at these points.

* 7.5 – our applications guy (Parry Husbands) feels the bulk memory copy functions (upc_memcpy, upc_memget, upc_memput) should be augmented with explicitly non-blocking versions to better support explicitly bulk-synchronous applications. We’re not sure what the interface should be (or if the collective operations group is already looking at this) but we feel there is considerable opportunity for performance improvement by exposing a non-blocking bulk transfer to the programmer.