The development of my programming language scanner has reached a significant milestone: it is now fully operational. Building upon a foundational skeleton previously established for single and two-character lexemes, the scanner has been significantly enhanced to effectively tokenize numbers, decimals, strings, keywords, and comments.

(See project commit for details: Commit 9ed6135)

Key Implementation Details and Learnings:

  1. Handling Comments:
    • A crucial peek() function was implemented, serving as a non-destructive lookahead for the next character. This contrasts with advance(), which destructively moves the current pointer forward.
    • For single-line comments (e.g., //), the scanner advances until \n is encountered, which peek() helps identify as the line end.
    • Similarly, peekNext() was utilized to inspect the character following peek().
    • Initial challenges arose with peekNext() for multiline comments, which were later resolved.
  2. Scanner’s Ignored Characters:
    • The scanner is designed to disregard common whitespace characters like spaces, tabs, and carriage returns. A line counter is updated upon encountering \n.
    • While ignored for tokenization, the significance of whitespace in other languages is noteworthy. For instance, C’s #include <stdio.h> demands precise spacing, and Python relies on indentation for block structuring, highlighting the varying roles of whitespace across languages.
  3. String Tokenization:
    • Strings are processed by advancing the scanner until a closing " character is reached.
    • The substring containing the string’s content (between the quotes) is extracted and stored as a Java String object, then passed to the addToken() function.
  4. Number Tokenization:
    • Characters that satisfy the isDigit() function (0-9) are recognized as part of a number.
    • Decimal numbers are handled by peek()ing for a . and then peekNext()ing to confirm a digit follows the .
    • If both conditions are met, the entire lexeme is stored as a NUMBER. If a . is followed by a non-digit, the numeric part is tokenized as NUMBER, and the . as DOT.
    • The isAlpha() function is used to distinguish numeric tokens from identifiers.
  5. Keyword Recognition:
    • A Hash Map is employed to efficiently map keyword lexemes to their corresponding token types, allowing for quick identification of reserved words.

(Visual examples of scanner output are included in the original article’s images.)

Addressing a Key Challenge: Multiline Comments

A significant hurdle was encountered when attempting to use peekNext() for correctly parsing multiline comments, unlike its successful application with decimal numbers. My initial approach, mirroring the logic for single-line comments, proved ineffective:

case '/':
         else if (match('*')) {
              while (peek() != '*' && peekNext() != '/' &&!isAtEnd()) advance();
         }

The core issue stemmed from using a single advance() variable declared outside the multiline comment’s processing loop. This meant the current pointer wasn’t being correctly updated within the loop for subsequent peek() and peekNext() calls in a way that would properly consume the * and / ending sequence.

After some investigation, including hints from AI, I realized the need to iterate character by character within the multiline comment block, specifically handling the termination sequence. The peek() function is solely for inspection, not for consumption. When peek() identifies * and peekNext() identifies /, the * is advanced past, but the / remains. To correctly consume both, an explicit advance() call for the * and another for the / is necessary, or more robustly, advancing past the * and then checking the new peek() for /.

The corrected logic for multiline comments now appears as:

case '/':
         if (match('/')) {
             while (peek() != '\n' && !isAtEnd()) advance();
         }
         //multineline comments
         else if (match('*')) {
             while (!isAtEnd()) {
                   char ch = advance(); // Advance and get current character
                   if (ch == '\n') line++; // Update line count for newlines within comment
                   if (ch == '*' && peek() == '/') { // If current is '*' and next is '/'
                      advance(); // Consume the '/'
                      return; // Exit multiline comment processing
                   }
             }
             Lox.error(line, "Unterminated comment."); // Error for unclosed comments
         }
         else {
              addToken(SLASH); // If neither, it's just a slash token
         }
         break;

This refined approach ensures that the scanner correctly advances past both the * and / characters that signify the end of a multiline comment, preventing the final / from being erroneously treated as a new token.

Next Steps:

The successful completion of the scanner paves the way for the next phase of development: the Parser!

Reflections:

Overcoming this particular challenge with multiline comments was a rewarding experience, reinforcing the importance of meticulous state management in scanner design. The process of understanding and debugging this complex interaction, much like navigating the vastness of the sea, provides valuable lessons in perseverance and self-reliance. It serves as a humbling reminder of the intricate details involved in software development and the continuous learning journey it entails.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed