Bidirectional ambiguities in Internet identifiers ------------------------------------------------- Roy Badami WORKING DRAFT Internet identifiers such as domain names, URIs and e-mail addresses are being augmented to allow them to contain characters from scripts other than the Latin alphabet. When such internationalized identifiers contain characters from right-to-left scripts, it is possible for two logically distinct identifiers to be displayed identically. Where two such identifiers are under different administrative control, security issues may result from the ambiguity. This paper analyses the ambiguities that can occur in Internet identifiers with reference to a display model that has been proposed for such identifiers. The potential for ambiguity between administrative domains is analysed within the context of internationalized domain names (IDNs) and the proposals for internationalized resource identifiers (IRIs). Recommendations are made concerning the choice of IDNs and IRIs in order to avoid such ambiguities. Issues related to the design of new internationalized identifiers, such as internationalized mail addresses (IMAs) are also discussed. Notation Unless otherwise noted, upper case in examples indicates a right-to-left script (ie bidirectional type R or AL). Examples prefixed with an asterisk are examples of invalid identifiers. Examples prefixed with a parenthesized asterisk are valid, but contrary to the recommendations in this document. Standard display model This paper assumes a standard display model that has been proposed for the display of identifiers such as IDNs, Internatialized Resource Identifiers (IRIs) and Internationalized Mail Addresses (IMAs). The standard display model is that an Internet identifier is rendered according to the Unicode bidirectional algorithm, in a left-to-right context. Specifically, the identifier is rendered at an even embedding level, such that the identifier constitutes the sole text in the level run (or by any other algorithm that gives equivalent results). This can be achieved, for example: * By rendering the identifier in a paragraph of its own, with the paragraph embedding level specified as even by a higher level protocol. * By prefixing the identifier with U+202A LEFT-TO-RIGHT EMBEDDING and suffixing it with U+202C POP DIRERECTIONAL FORMATTING prior to rendering by the bidireciontal algorithm * By prefixing and suffixing the identifier with U+200E LEFT-TO-RIGHT MARK prior to rendering by the bidirectional algorithm at an even embedding level. Control characters and formatting characters This paper assumes that identifiers do not contain any formatting characters. Specifically, they are assumed not to contain the bidirectional formatting characters LRM, RLM, LRE, RLE, LRO, RLO and PDF, or any characters of bidiectional type BN, B or S. Hence, an identifier may contain characters of types L, R, AL, EN, ES, ET, AN, CS, NSM, WS and ON. Bidirectional ambiguities The bidirectional algorithm consists of two parts: * resolving characters to bidirectional embedding levels * reordering characters according to their bidirectional embedding level This process converts logical order to display order. Where the embedding level of every character is known, the reodering can unambiguously be reversed; however in some cases the embedding level cannot be uniquely determined given the identifier in display order; this results in ambiguity. The possible embedding levels that can result for each character type in a namepreped identifier (assuming a base embedding level of 0) are shown below. L 0 R or AL 1 EN or AN 0 or 2 ES, ET or CS 0, 1 or 2 NSM 0, 1 or 2 ON 0 or 1 It is clear from the above that as long as the identifier contains only characters of strong types (L, R and AL), the embedding levels are uniquely defined, and no ambiguity can occur. The potential for ambiguity occurs when characters with weak or neutral types appear in the identifier. The following sections discuss the weak and neutral types in more detail. Weak type NSM: Non-Spacing Mark Characters of type NSM are combining characters that modify the character they follow. Since they always apply to the character that precedes them in logical order, no ambiguity arises provided they are rendered accordingly. It is briefly noted here that different sequences of characters in logical order can be mapped to the same sequence in display order. However, these strings will still display differently when correctly rendered. In the following example, ' represents a character of type NSM a'A (logical order) (1) a'A (display order) aA' (logical order) (2) a'A (display order) However, in (1), the non-spacing mark will be applied to the character a, and in (2) the non-spacing mark will be applied to the character A, so no ambiguity arises. Type NSM is not discussed further in this paper. Neutral type ON: Other Neutral Given the presence of only strong and neutral types, again no ambiguity occurs. Because the directionality of the base embedding is always left-to-right, neutral types will resolve to type L unless both the closest preceding strong character is right-to-left (type R or AL) and the closest following strong character is right-to-left. Because neutral characters that resolve to type R will always occur within a level run (never as its first or last character) the embedding level of a neutral character can always be determined from display order (in the absence of weak characters). Specifically, if in display order a sequence of consecutive weak characters is bracketed by right-to-left characters, then its embedding level is 1, otherwise its embedding level is 0. Hence it can be seen that--within the context of the standard display model--weak characters are the sole source of bidirectional ambiguity. Weak types EN and AN: European number and Arabic number In the absence of separator and terminator types (discussed below), the number types (types EN and AN) behave identically. However, they may resolve to type L (level 0) or remain type EN or AN (level 2). Specifically, a character of number type will have an embedding level of 2 if and only if the closest preceding strong character (in logical order) in a right-to-left character. This can however not be unambiguously determined from display order. 123ABC (logical order) 123CBA (display order) ABC123 (logical order) 123CBA (display order) abc123ABC (logical order) abc123CBA (display order) abcABC123 (logical order) abc123CBA (display order) This type of ambiguity is refered to in this paper as number embedding level ambiguity. A sufficient condition to avoid number embedding level bidirectional ambiguity is as follows: For each number character in the identifier, the following condition must hold true in order to avoid ambiguity. The directionality of the closest preceding strong character must be the same as the directionality of the closest following strong character. For the purposes of this condition, the beginning of the string and the end of the string are treated as strong characters with a left-to-right directionality. The above is also a necessary condition to avoid ambiguity in the absence of further restrictions on the form of the identifier. (If one of the possible logical orders corresponding to the display order is declared invalid, then no ambiguity will occur. This issue is discussed further later in this paper.) Weak types ES, ET and CS: European Separator, European Terminator and Common Separator Number type ambiguity Separator and terminator characters normally resolve to type ON (other neutral) and hence have no special behaviour. However, when these characters occur adjacent to number characters, they are sometime treated as part of the number. The distinction between european numbers and arabic numbers influences which characters are regarded as sparators and terminators. One possible form of ambiguity occurs because it may not be possible to determine from display order whether characters of type EN (european number) actually parse as european or arabic numbers, and hence the embedding level of adjacent separator and terminator characters may be unknown. This is referred to as number type ambiguity. Number type ambiguity is unlikely to occur in practice, however, if care is taken in the resolution of number embedding level ambiguity. Specifically, if number embedding level ambiguity is resolved using the condition in the preceding section then number type ambiguity cannot occur. Resolving the number type requires knowledge of the type of the closest preceding strong character in logical order. Where a substring of weak and neutral characters is bracketed by left-to-right characters (or by the beginning or end of the identifier) then the closest preceding strong character in logical order is the closest preceding strong character in display order. Where the substring is bracketed by right-to-left characters then the closest preceding strong character in logical order is the closest following strong character in display order. Hence the closest preceding strong character in logical order can always be determined. However, care may need to be taken when number embedding level ambiguity is resolved by disallowing certain identifiers. Number separator ambiguity A further type of bidirectional ambiguity occurs because of the way that rules W4 and W5 of the bidirectional algorithm interact. Specifically, it is not always possible to determine from display order whether a separator character (type ES or CS) parsed as a number or a neutral. A-123/456B (logical order) B-123/456A (display order) A456/-123B (logical order) B-123/456A (display order) In order to avoid ambiguity, for each character of type ES or CS, the following checks must be made. Each character of type ES or CS must pass this check in order for the identifier to display unambiguously. For example, number type ambiguity occurs when a substrings of the following form occur in logical order: R WN* ET+ EN+ (ES|CS) EN+ R WN* EN+ (ES|CS) ET+ EN+ R WN* EN+ (ES|CS) EN+ ET+ R WN* EN+ ET+ (ES|CS) EN+ R WN* ET+ EN+ (ES|CS) EN+ ET+ R WN* EN+ ET+ (ES|CS) ET+ EN+ Where WN stands for any weak or neutral character (types EN, ES, ET, AN, CS, NSM), and R, EN, ES, ET and CS stands for any character of that type. Note: These regexps aren't quite right, though I think that prohibiting the above may be enough to avoid ambiguity. Minor problem: WN* may end with a number, terminator or separator character, so the above covers too much. Bigger problem: figure out how to handle numbers containing multiple separators, eg -123/456/789. However, at the moment, although the regexps catch too much, they appear to also catch at least the obvious cases of numbers with multiple separators In each pair above there exists ambiguity between an identifier matching the first regular expression in the pair and an identifier matching the second regular expression. Hence, whilst it is sufficient to disallow all identifiers matching the above expressions, it is adequate only to disallow identifiers matching one expression from each pair. Bidirectional ambiguity in Internationalized Domain Names (IDNs) IDNA places certain restrictions on the the bidirectional types of characters occuring in domain name labels, which has an impact on the analysis of bidirectional ambiguity. In some cases potential ambiguity is resolved by disallowing one of the possible domain labels. Consider the following: abc123.ABC.com (logical order) abc123.CBA.com (display order) * abcABC.123.com (logical order) abc123.CBA.com (display order) The label abcABC is illegal within IDNA, and hence no ambiguity arises. The following sections discuss the amibuities that can occur in IDNs. Number separator ambiguity within IDNs IDNA does not prevent number separator amibuity. Consider the domain name labels A-123,456B (logical order) B-123,456A (display order) A456,-123B (logical order) B-123,456A (display order) where A and B are characters of bidirectional type R and ',' is a character of bidirectional type ES or CS. Note however that the only character of type ES that can occur in a nameprepped name is SOLIDUS, and this is not allowed in domain names that follow hostname rules (and hence set the UseSTD13Rules flag in IDNA). Also, the only character of type CS that can occur in a nameprepped name that follows hostname rules is ARABIC COMMA. Note also that a label such as -123,456 may give rise to number separator ambiguity if it preceding strong character (in a preceding label) is of type R. (Of course, this label is not permitted by hostname rules.) Number embedding level amiguity in IDNs Number embedding level ambiguity can occur in IDNs 123.ABC.com (logical order) 123.CBA.com (display order) ABC.123.com (logical order) 123.CBA.com (display order) However, some potential instances of number embedding level ambiguity within IDNs are resolved by the restrictions within IDNA. abc123.ABC.com (logical order) abc123.CBA.com (display order) * abcABC.123.com (logical order) abc123.CBA.com (display order) Since the domain label abcABC is prohibited by the IDNA specifications, no ambiguity arises. In order to further discuss the interaction of bidirectional characters, it is helpful to categorise domain name labels by the bidirectional types of the characters they contain. strong label: A domain label which contains at least one strong character (type L, R or AL) LTR label: A strong label which contains at least one left-to-right character (type L). Such labels will never contain right-to-left characters. RTL label: A strong label which contains at least one right-to-left character (type R or AL). Such labels will never contain left-to-right characters. number label: A domain label which contains no strong characters, but at least one number character (type EN or AN). number prefix label: An LTR label that contains a number character prior to the first left-to-right character number suffix label: An LTR label that contains a number character after the last left-to-right character. neutral label: A domain label which contains no strong characters and no number characters. In order to apply the criterion of secion ??? above to each number character in the domain name, it is helpful to consider three cases, according to the kind of label in which the number character occurs. 1. A number character occuring in an RTL label. Since IDNA requires RTL labels to begin and end with a right-to-left characters, a number character occuring in an RTL label will always have a right-to-left character as both its closest preceding and its closest following character. Hence, applying the criterion of section ???, no amibiguity will arise. 2. A number character occuring in an LTR label. LTR labels are not subject to the same restrictions as RLT labels. Since an LTR label is not required to begin and end with a left-to-right character, it is possible that either the closest preceding strong character or the closest following strong character will be in another label. (Note that both of these cannot be true simultaneously, however, since by definition an LTR label containts an least one left-to-right character.) If that character is a left-to-rigth character, then of course no amiguity will occur. Similarly, if there is no such strong character, then no ambiguity will occur. The case of interest is therefore that when the closest strong character is a right-to-left character occuring in another label. This will necessarily be the last character of the label (for a preceding label) or the first character of the label (for a following label). Consider for example: abc123.456.ABC (logical order) abc123.456.CBA (display order) It can be seen that the restrictions placed by IDNA prevent any ambiguity arising in this case, since the corresponding domain name that would display identically contains an invalid label: abcABC.123.456 (logical order) abc123.456.CBA (display order) This will always be the case for number characters occuring in LTR labels (need to explain why). 3. Number characters occuring in number labels. This is the only case where number embedding level ambiguity will occur in IDNs. In order to check for number embedding level ambiguity in a domain name, it is necessary to identify all the number labels in the domain name. If no number labels exist, no ambiguity will arise. If number labels are present, then for each number label, the following checks must be made: Identify the closest preceding strong label (if any) and the closest following strong label (if any) to the number label in question. If both these labels are LTR labels, or if both these labels are RTL labels, then no ambiguity is created. If either (or both) labels do not exist, they are taken to be LTR labels. If the closest preceding strong label is an RTL label, and the closest following strong label is a number prefix label, then no ambiguity arises. (The potential ambiguity is resolved by the restrictions within IDNA.) ABC.123.456com 123.456.CBAcom * 123.456.ABCcom 123.456.CBAcom ABC.123.?456com 456?.123.CBAcom * 456?.123.ABCcom 456?.123.CBAcom If the closest preceding strong label is a number suffix label, and the closest following strong label is an RTL label, then no ambiguity arises. (The potential ambiguity is resolved by the restrictions within IDNA.) com123.456.ABC com123.456.CBA * comABC.123.456 com123.456.CBA com123?.456.ABC com123?.456.CBA * comABC.456.?123 com123?.456.CBA Otherwise, ambugiuity arises. Number type ambiguity in IDNs It is believed that amber type amiguity cannot occur in IDNs (is this true?). Dealing with bidirectional ambiguity Where two distinct internet identifiers are displayed identically by the bidirectional algorithm, it is highly undesirable that both refer to different resources. In many cases both identifiers will be under the same administrative control, and hence it can simply be left to the administrator of that subspace of the identifier namespace to allocate neither identifier, to allocate only one of the identifiers, or to allocate both identifers to the same resource. Whilst a specification for a particular internet identifer may recommend or require one of these approaches, prudent allocation of identifiers is sufficient in this case. Consider the follwowing IRIs: http://www.ABC.com/123/ABC/ (logical order) http://www.CBA.com/123/CBA/ (display order) http://www.ABC.com/ABC/123/ (logical order) http://www.CBA.com/123/CBA/ (display order) Since both IRIs are under the control of the administrator of the domain ABC.com, it can be left to the administrator of the domain to avoid allocating these two IRIs to different resources. Of course, if the administrator of the domain delegates control of different parts of their IRI namespace to different entities, they will have to formulate policies to avoid ambiguities ocurring. However, where two identifiers under different administrative control could display identically, the situation is somewhat different. In some cases the protocol may prohibit the use of one or both identifiers, hence removing the potential ambiguity. Where both identifiers are permitted by the procotol, both must generally be avoided by the administrations allocating identifiers, since another administration may allocate a visually identical identifier that references a different resource. Note that this is true even if the protocol recommends against the use of one of the two identifiers, as long as it does not actually prohibit its use. The administration who uses the preferred identifier is still at risk from ambiguity created by another administration allocating the deprecated (but legal) identifier. Consider the following IDNs 123.ABC.com (logical order) 123.CBA.com (display order) ABC.123.com (logical order) 123.CBA.com (display order) Since these IDNs are likely to be within different administrative control, neither should be used. If the administration controling the domain ABC.com chose to allocate the subdomain 123.ABC.com, they would be powerless to prevent the administration controlling the domain 123.com allocating the subdomain ABC.123.com and hence creating ambiguity. A similar problem arises with the following IMAs: 123@ABC.com (logical order) 123@CBA.com (display order) ABC@123.com (logical order) 123@CBA.com (display order) Recommendations concerning the use of IDNs The analysis of bidirectional ambiguity in IDNs above considers only domain names rendered as an indenitifers in their own right. For instance, the IMA DEF@123.ABC.com is unambiguous, even thought the domain name 123.ABC.com cannot be rendered unambiguously. However, given the ubiquity of domain names in the Internet, it is suggested that it is undesirable to use identifiers which contain domain names that cannot be rendered unambiguously, even though the identifier itself is unambiguous. Also, given the heirarchical nature of the DNS, it is desirable that parent domains are unambiguous. For instance, even though the domain DEF.123.ABC.com is unambiguous, its parent domain 123.ABC.com is not, and hence it is suggested that the use of this domain name is undesirable. Similarly, relative domain names are sometimes used in large organizations, and it is desirable that these can be constructed without creating ambiguity. Recommentation I It is recommended that domain labels matching the following regular expressions not be used in any Internet identifier. (^|R) WN* ET+ EN+ (ES|CS) EN+ (^|R) WN* EN+ (ES|CS) ET+ EN+ (^|R) WN* EN+ (ES|CS) EN+ ET+ (^|R) WN* EN+ ET+ (ES|CS) EN+ (^|R) WN* ET+ EN+ (ES|CS) EN+ ET+ (^|R) WN* EN+ ET+ (ES|CS) ET+ EN+ Where WN stands for any weak or neutral character (types EN, ES, ET, AN, CS, NSM), and R, EN, ES, ET and CS stands for any character of that type, and ^ matches the beginning of the label. Rationale: These identifiers either cause number separator ambiguity in themselves, or have the potential to do so when preceded by a label ending in a character of bidirectional type R. Note: These regexps may not be adequate. See the note following the regexps in secion ??? (Number separator ambiguity). Recommendation II (tentative) Top level domains should be strong labels. Rationale: Neutral or number labels are unlikely to be used as top-level domains in practice, and disallowing them allows designers of identifiers in which an IDN is followed by other internationalized text to simplify slightly the restrictions on the text following the IDN. http://www.123.ABC/abc (logical order) http://www.123.CBA/abc (display order) (*) http://www.ABC.123/abc (logical order) http://www.123.CBA/abc (display order) If IRIs or other similar identifiers required domain names (absolute or relative) to end in a strong label, then the first example above is safe. In addition, by prohibiting number labels as top-level domains, it may help avoid issues of number separator ambiguity. [It's not really clear that prohibiting this gains much, but then it's not clear that it loses much either] Recommendation III Where an internet identifier consists of or contains a domain name, and that domain name contains one or more number labels, then for each number label in the domain name, the following should hold true: 1. If there is a preceding strong label within the domain name, then the closest preceding strong label should be an LTR label. 2. If there is a following strong label within the domain name, then the closest following strong label should be an LTR label. Note that the above applies only to domain name labels, not to any other internationalized text that may precede or follow the domain name in the identifier. Rationale: This ensures that number embedding level ambiguity is avoided in the domain name itself, and also in all parent domains, and in relative domains constructed from it. Bidirectional ambiguitity in other internet identifiers In many cases it may be acceptable for the design of internet identifiers to permit ambiguity, and to rely on administrative procedures within the entity allocating the identifiers to employ suitable procedures to avoid allocating ambiguous identifiers. However, it is suggested that any IDN component of the identifier should follow recommendations I through III above. In addition, it is suggested that identifiers in which the domain name cannot be uniquely determined from the dislayed representation should be avoided. For example, given the following displayed IMA 123@CBA.com (display order) It is impossible to determine whether the domain is ABC.com or 123.com. Therefore, it is suggested that addresses such as 123@ABC.com and ABC@123.com (both logical order) should be avoided, even though the domains ABC.com and 123.com are both consistent with recommendations I through III above.