File Icon Extractor Data Extraction Overview
Apr 16

Printer Friendly Version

Download Source Code: CSharpSyntaxHighlighter.zip - 11.54KB

CSharpSyntaxHighlighter Class

CSharpSyntaxHighlighter, a syntax highlighter for C# programming language, is first implementation of a highlighter, fully described in this magazine. This highlighter will recognize and show with different colors C# keywords, type names, comments, char and string values. We'll use a helper method for each kind of element.

All C# language-specific keywords are loaded in internal _keywords list on first class instantiation. Among the keywords you will recognize statements (for, while, continue), accessors (public, private, static), aliases of primitive types (short, byte, bool, int), directives (#region, #endregion) and other reserved words. C# is a case-sensitive language, so when an identifier is recognized in AddIdentifier method as a keyword, it will be enclosed within a SPAN block with net_key CSS class name. In SyntaxHighlighters.css, you see that, by default, this is #0000ff (blue). As simple rule for many languages, whatever identifier you find, which is not within a comment block or string value (separated by quotes), is either (in this order) a keyword, a namespace/type/member name, or something else, part of the language.

AddIdentifier adds to the _result output buffer whatever identifier you collected lately, formatted based on its type. We do not separately store this identifier, but increment _chars each type we parse a new character which is part of such identifier. If _chars has a strict positive value, Substring(-chars) will return its string value, the last _chars characters before current _start position in _buffer. If you show expandable regions, #region and #endregion will be processed differently. Regular identifiers are HTML encoded, to generate correct HTML source code.

/// <summary>
/// Collect an identifier.
/// Pre-condition: _chars>0
/// </summary>
/// <returns>Formatted HTML for identifier</returns>
private string AddIdentifier()
{
    Debug.Assert(_chars > 0);
    string token = Substring(-_chars);
    _chars = 0;
    Debug.Assert(token.Length > 0);

    // keyword?
    if (_keywords.Contains(token))
    {
        // directive? (#region or #endregion)
        if (token[0] == '#')
        {
            bool region = (token == "#region");
            _end = FindNext(NewLine(), _start, 0);
            token = AddSpan(token, "net_directive")
                + AddHyperlinks(Substring());
            token = (region ? AddCollapsibleBlock(token)
                : token + EndCollapsibleBlock());
            _start = _end;
        }
        else
            token = AddSpan(token, "net_key");
    }

    // type name?
    else if (_types.Contains(token))
        token = AddSpan(token, "net_type");

    // any other identifier is HTML-encoded
    else
        token = Encode(token);

    _result.Append(token);
    return token;
}

AddCharValue recognizes and show character values passed between single quotes. From a first ', it will look until next ' is found, skipping \' inside. In a similar way and using the same color, AddStringValue parses text within double quotes. Verbatim @ strings may include "" inside, which is the quote character itself. So our parser will skip any consecutive odd number of quotes. Regular string values are single-line, start and end by ", except \" (but not \\"), to skip. Some syntax highlighters also recognize and show with the same color numeric values and anything else looked at as a literal. We'll not do that, not in this first iteration of our project.

/// <summary>
/// Collect a character value '...'
/// Pre-condition: current char is '
/// </summary>
/// <returns>Formatted HTML for char value</returns>
private string AddCharValue()
{
    Debug.Assert(Current() == '\'');
    _start++;

    for (_end = _start;
        CharAt(_end) != '\0' && (CharAt(_end) != '\''
        || CharAt(_end - 1) == '\\' && CharAt(_end - 2) != '\\');
        _end++) ;

    string token = Substring();
    _result.Append(AddSpan("'" + Encode(token) + "'", "net_string"));
    _start = _end + 1;
    return token;
}

/// <summary>
/// Collect a string value "..." or verbatim (multi-line) @"..."
/// Pre-condition: starts by " or @"
/// </summary>
/// <returns>Formatted HTML for string value</returns>
private string AddStringValue()
{
    Debug.Assert(Current() == '"'
        || Current() == '@' && Current(1) == '"');
    bool verbatim = (Current() == '@');
    _start++;

    // @"..." verbatim string value continues till "
    // or odd number of consecutive " found
    if (verbatim)
    {
        _start++;
        bool evenQuotes = true;
        for (_end = _start;
            CharAt(_end) != '\0' && (CharAt(_end) != '\"'
            || (evenQuotes = !evenQuotes)
            || CharAt(_end + 1) == '\"');
            _end++) ;
    }
    // "..." string value continue till "
    // not preceeded by single \ (\\" is actually \ followed by ")
    else
        for (_end = _start;
            CharAt(_end) != '\0' && (CharAt(_end) != '\"'
            || CharAt(_end - 1) == '\\'
            && CharAt(_end - 2) != '\\');
            _end++) ;

    string token = Substring();
    _result.Append(AddSpan((verbatim ? "@" : "") + "\""
        + AddHyperlinks(token) + "\"", "net_string"));
    _start = _end + 1;
    return token;
}

C# accepts three different kinds of comments, all displayed with the same color, usually a light green. The single-line // comment, collected by AddSingleLineComment, shows the rest of the line as text comment. The multi-line comment, processed by AddMultiLineComment, starts from /* and goes until next */ is found. Both these kinds of comments are similar to those used in C and derived languages. The specific .NET documentation comment is similar to the single-line comment, but starts with ///, and XML tag sequences (as any text between < and next >) appear in a different color. Both the XML tags and /// are usually shown in a light gray:

/// <summary>
/// Collect single-line comment // ...
/// </summary>
/// <returns>Formatted HTML for comment</returns>
private string AddSingleLineComment()
{
    _start += 2;

    // collects all text from // till end-of-line
    _end = FindNext(NewLine(), _start, 0);
    string token = Substring();

    if (!_showComments)
    {
        // if skip comments, just replace
        // everything with an empty line
        _result.Append(NewLine());
        _end++;
    }
    else
        _result.Append(AddSpan(
            "//" + AddHyperlinks(token), "net_comment"));
    _start = _end + 1;
    return token;
}

/// <summary>
/// Collect multi-line comment /* ... */
/// </summary>
/// <returns>Formatted HTML for comment</returns>
private string AddMultiLineComment()
{
    _start += 2;

    // collects all text from /* till next */
    _end = FindNext("*/", _start, 2);
    string token = Substring();

    if (_showComments)
        _result.Append(AddSpan(
            "/*" + AddHyperlinks(token), "net_comment"));

    _start = _end + 1;
    return token;
}

/// <summary>
/// Collect documentation (single-line) comment /// ...
/// Detect and format <...> documentation tags inside
/// </summary>
/// <returns>Formatted HTML for comment</returns>
private string AddDocumentationComment()
{
    _start += 3;

    // collects all text from /// till end-of-line
    _end = FindNext(NewLine(), _start, 0);
    string token = Substring();

    if (!_showComments)
    {
        // if skip comments,
        // just replace everything with an empty line
        _result.Append(NewLine());
        _end++;
    }
    else
    {
        // /// and <...> XML doc tags inside are shown in light gray
        _result.Append(AddSpan("///", "net_xml"));
        for (int end = _end; _start < end; _start++)
        {
            if (Current() == '<')
            {
                // show regular comment text
                _result.Append(AddSpan(
                    AddHyperlinks(
                        Substring(-_chars)), "net_comment"));
                _chars = 0;

                // show <...> XML tag content
                _end = FindNext(">", _start, 1);
                _result.Append(AddSpan(
                    AddHyperlinks(Substring()), "net_xml"));
                _start = _end - 1;
            }
            else
                _chars++;
            _end = end;
        }

        // show last regular comment text left
        if (_chars > 0)
        {
            _result.Append(AddSpan(
                AddHyperlinks(
                    Substring(-_chars)), "net_comment"));
            _chars = 0;
        }
    }

    _start = _end + 1;
    return token;
}

With these simple helper methods for each kind of C# element we provided highlighting support for, the ProcessThis method override looks simple and compact:

/// <summary>
/// Returns highlighted HTML from unformatted C# text code
/// </summary>
/// <param name="code">C# source code text</param>
/// <returns>highlighted HTML</returns>
protected override string ProcessThis(string code)
{
    _buffer = code.Trim(' ', '\t', '\r', '\n');
    _result = new StringBuilder(_buffer.Length * 2);
    _start = _chars = 0;

    // sequential traversal of input string buffer
    while (_start < _buffer.Length)
    {
        // collected an identifier?
        // (keyword, directive, type, other)
        if (_chars > 0
            && (Char.IsWhiteSpace(Current())
            || Char.IsPunctuation(Current()) && Current() != '#'))
            AddIdentifier();

        // character value? ('...')
        if (Current() == '\'')
            AddCharValue();

        // string value? ("..." or verbatim @"...")
        else if (Current() == '"'
            || Current() == '@' && Current(1) == '"')
            AddStringValue();

        // comment? (single-line, multi-line, documentation)
        else if (Current() == '/'
            && (Current(1) == '*' || Current(1) == '/'))
        {
            if (Current(1) == '*')
                AddMultiLineComment();
            else if (Current(2) == '/')
                AddDocumentationComment();
            else
                AddSingleLineComment();
        }

        // advance/collect next
        // whitespace/punctuation/identifier char
        else
        {
            if (Char.IsWhiteSpace(Current())
                || Char.IsPunctuation(Current()) && Current() != '#')
                _result.Append(Encode(Current().ToString()));
            else
                _chars++;
            _start++;
        }
    }

    // collect last identifier
    if (_chars > 0)
        AddIdentifier();

    // add line numbers,
    // Show/Hide outer collapsible block and embed result
    AddLineNumbers();
    return "<div class=\"sh_result\"><pre>"
        + AddCollapsibleBlock("") + _result.ToString()
        + EndCollapsibleBlock()
        + "</pre></div>";
}

The result code essentially contains plenty of SPAN tags for each element wrapping, which basically shows the element text with a different color. The whole text is encapsulated first in a DIV block, whose sh_result style you can simply customize in the style sheet, for the choice of font name, font size, background color and other display attributes. When using collapsible blocks, we add a Show/Hide switch user can use to hide the sequence from the page. This is particularly useful when code is too long and may take too much space compared with the other text around.

Formatted code is finally wrapped within a PRE HTML tag, which avoids us to translate tabs, new line and carriage return characters, and shows them as they are. Without this PRE-formatted code, we would have had to translate \n's into BR, remove \r's, expand \t's into a fixed number of spaces. And also convert any space character, in consecutive sequences of more than one blank, to the HTML NBSP representation. The HTML result code would have been too long and hard to understand.

Check out our Interactive C# Syntax Highlighter, where you can copy and paste into a text box your own C# code, and you'll get back equivalent, but highlighted, HTML code.

Subscribe and Share: Subscribe using any feed reader Bookmark and Share

Leave a Reply