String literal
From Free net encyclopedia
A string literal is a notation for representing a string value within the text of a computer program. The variety of forms of such notations seems almost as varied as the range of programming languages available.
The most common way of representing literal strings is delimited in quotation marks:
"Hi There!"
Some languages also allow the use of single quotes as an alternative to double quotes (though the string must begin and end with the same kind of quotation mark):
'Hi There!'
Note that these quotation marks are unpaired (the same character is used as an opener and a closer), which is a hangover from the typewriter technology which was the precursor of the earliest computer input and output devices. The Unicode character set includes paired (separate opening and closing) versions of both single and double quotes:
“Hi There!” ‘Hi There!’
The paired double quotes can be used in Visual Basic .NET.
Contents |
Including Quotation Marks
Immediately the problem arises: how do you include quotation marks themselves in the strings? If the language allows the use of both styles of quotation marks (e.g. Modula-2), then you can embed one style by quoting with the other style:
"This is John's apple." 'I said, "Can you hear me?"'
but this doesn't allow for the inclusion of both styles of quotes at once within the same literal.
Some languages (such as DCL) deal with this by doubling up on the quotation marks, to indicate embedded ones:
"I said, ""Can you hear me?"""
but this leads to its own difficulties, and is not a fully general solution.
A few languages have also starting using triple quoting, which originated in Python, which looks like:
'''This is John's apple.'''
...however this is not completely general, as you can't easily embed a set of three quote characters (not often needed, apart from when talking about the language itself) or put a quoting character at the end of the string literal.
Thus, there is inevitably the need for an escape character, whose meaning is "disregard any special meaning for the immediately-following character, just include it as is within the literal." The most commonly-used character for this purpose is the backslash "\", the tradition for which originated on Unix. Thus, the second quotes-within-quotes example above could be rewritten:
"I said, \"Can you hear me?\""
Naturally, the escape character can also be used to escape itself. This convention is used in a large variety of scripting/programming languages, including C, Perl, Python, JavaScript and Bash.
One minor problem this creates, is if you need to embed lots of backslash characters, it can look ugly, for example:
"The Windows path is C:\\\\Foo\\Bar\\abcd\\"
...and again a few languages are following a convention started in Python where a leading character marks a string as being "raw", so you can have:
r"The Windows path is C:\\Foo\Bar\abcd\"
Including Other Specials
Typically, many characters have a special meaning in programming language texts, or may be illegal altogether. Yet there is a need for programs to deal with data containing such characters, and hence a need to represent them in string literals.
Having introduced the concept of an escape character, it becomes possible to extend its meaning somewhat, by following it with certain characters which ordinarily would not be special on their own.
For instance, in a C string literal, if the backslash is followed by a letter such as "b", "n" or "t", then this represents a nonprinting backspace, newline or tab character respectively. Or if the backslash is followed 3 octal digits, then this sequence is interpreted as representing the arbitrary character with the specified ASCII code:
"\042" /* equivalent to "\"" */
This was later extended to allow more modern hexadecimal character code notation:
"\x5C" /* equivalent to "\\" */
Other Quoting Styles
In the original FORTRAN programming language, string literals were written in so-called Hollerith notation, where a decimal count of the number of characters was followed by the letter H, and then the characters of the string:
27HAn example Hollerith string
This had no problems with embedding any representable character, but was error-prone for humans.
In the PostScript programming language, string literals are enclosed in parentheses, with embedded newlines allowed, and also embedded unescaped parentheses provided they are properly paired:
(The quick (brown fox))
The backslash escaping mechanism is also available.
Textual Substitution
Some notations start by assuming that everything is a string literal unless otherwise specified (see Category:Transformation languages). Common examples are macro preprocessors, command-line languages and markup languages.
Thus, in Bash, the command
echo hi there
literally echoes the string "hi there". Quotation marks and backslash can still be used to control the meaning of special characters, such as the use of the $ character to indicate substitution of a variable or expression.
Limited Textual Substitution
Some languages (notably Perl and PHP) allow not-quite-pure forms of string literal in which values of variables may be directly substituted. Thus, in Perl, if the variable $name has the value "John", then
print "Hello, $name.\n";
is equivalent to
print "Hello, " . $name . ".\n";
both of which will print
Hello, John.
whereas
print 'Hello, $name.\n';
(note the different quote characters) and
print "Hello, \$name.\n";
(using the backslash to escape the significance of the "$" character) will both print
Hello, $name.
Embedding One Language Inside Another
The need to write a program in one language that generates on-the-fly sequences in another language is becoming more common. Examples are:
- generating a PostScript representation of a document for printing purposes, from within a document-processing application written in C or some other language.
- building a Web-based application in Perl, Python, PHP or C, which needs to generate Web pages containing embedded JavaScript on the front end, while querying and updating an SQL database at the back end.
In these situations, it is common to need to include string literals within the generated language text. At the same time, there is usually no (enforceable) restriction on the kinds of characters that may need to be included in such strings; thus, the generating code needs to be carefully written to properly escape all characters that may have special meaning in the generated language, in order to ensure that the output is not syntactically invalid or ends up meaning something completely different from that which was intended.
This gets particularly acute in the case of Web-based applications, where malicious users can take advantage of such weaknesses to subvert the operation of the application, for example by mounting an SQL injection attack.