可怜的人为C#的“词法分析器”

我试图用C#编写一个非常简单的解析器。

我需要一个词法分析器-一种使我可以将正则表达式与令牌相关联的工具,以便它可以读取正则表达式并返回符号。

看来我应该能够使用Regex进行实际的繁重工作,但是我看不到一种简单的方法。一方面,正则表达式似乎只适用于字符串,而不适用于流(为什么!!!!?)。

基本上,我想实现以下接口:

interface ILexer : IDisposable

{

/// <summary>

/// Return true if there are more tokens to read

/// </summary>

bool HasMoreTokens { get; }

/// <summary>

/// The actual contents that matched the token

/// </summary>

string TokenContents { get; }

/// <summary>

/// The particular token in "tokenDefinitions" that was matched (e.g. "STRING", "NUMBER", "OPEN PARENS", "CLOSE PARENS"

/// </summary>

object Token { get; }

/// <summary>

/// Move to the next token

/// </summary>

void Next();

}

interface ILexerFactory

{

/// <summary>

/// Create a Lexer for converting a stream of characters into tokens

/// </summary>

/// <param name="reader">TextReader that supplies the underlying stream</param>

/// <param name="tokenDefinitions">A dictionary from regular expressions to their "token identifers"</param>

/// <returns>The lexer</returns>

ILexer CreateLexer(TextReader reader, IDictionary<string, object> tokenDefinitions);

}

因此,pluz发送了codz …

不,认真的说,我将要开始编写上述接口的实现,但我很难相信在.NET(2.0)中尚没有一些简单的方法可以做到这一点。

那么,对上述简单方法有何建议?(此外,我不需要任何“代码生成器”。性能对此并不重要,并且我也不想在构建过程中引入任何复杂性。)

回答:

我在此处作为答案发布的原始版本存在一个问题,即只有在有多个与当前表达式匹配的“ Regex”时才起作用。也就是说,只有一个Regex匹配时,它将返回令牌-

而大多数人希望Regex是“贪婪的”。对于诸如“引号字符串”之类的情况尤其如此。

正则表达式之上的唯一解决方案是逐行读取输入(这意味着您不能拥有跨越多行的令牌)。我可以忍受-

毕竟,这是一个穷人的词汇!此外,在任何情况下从Lexer中获取行号信息通常都是有用的。

因此,这是一个解决这些问题的新版本。信用也去这

public interface IMatcher

{

/// <summary>

/// Return the number of characters that this "regex" or equivalent

/// matches.

/// </summary>

/// <param name="text">The text to be matched</param>

/// <returns>The number of characters that matched</returns>

int Match(string text);

}

sealed class RegexMatcher : IMatcher

{

private readonly Regex regex;

public RegexMatcher(string regex) => this.regex = new Regex(string.Format("^{0}", regex));

public int Match(string text)

{

var m = regex.Match(text);

return m.Success ? m.Length : 0;

}

public override string ToString() => regex.ToString();

}

public sealed class TokenDefinition

{

public readonly IMatcher Matcher;

public readonly object Token;

public TokenDefinition(string regex, object token)

{

this.Matcher = new RegexMatcher(regex);

this.Token = token;

}

}

public sealed class Lexer : IDisposable

{

private readonly TextReader reader;

private readonly TokenDefinition[] tokenDefinitions;

private string lineRemaining;

public Lexer(TextReader reader, TokenDefinition[] tokenDefinitions)

{

this.reader = reader;

this.tokenDefinitions = tokenDefinitions;

nextLine();

}

private void nextLine()

{

do

{

lineRemaining = reader.ReadLine();

++LineNumber;

Position = 0;

} while (lineRemaining != null && lineRemaining.Length == 0);

}

public bool Next()

{

if (lineRemaining == null)

return false;

foreach (var def in tokenDefinitions)

{

var matched = def.Matcher.Match(lineRemaining);

if (matched > 0)

{

Position += matched;

Token = def.Token;

TokenContents = lineRemaining.Substring(0, matched);

lineRemaining = lineRemaining.Substring(matched);

if (lineRemaining.Length == 0)

nextLine();

return true;

}

}

throw new Exception(string.Format("Unable to match against any tokens at line {0} position {1} \"{2}\"",

LineNumber, Position, lineRemaining));

}

public string TokenContents { get; private set; }

public object Token { get; private set; }

public int LineNumber { get; private set; }

public int Position { get; private set; }

public void Dispose() => reader.Dispose();

}

示例程序:

string sample = @"( one (two 456 -43.2 "" \"" quoted"" ))";

var defs = new TokenDefinition[]

{

// Thanks to [steven levithan][2] for this great quoted string

// regex

new TokenDefinition(@"([""'])(?:\\\1|.)*?\1", "QUOTED-STRING"),

// Thanks to http://www.regular-expressions.info/floatingpoint.html

new TokenDefinition(@"[-+]?\d*\.\d+([eE][-+]?\d+)?", "FLOAT"),

new TokenDefinition(@"[-+]?\d+", "INT"),

new TokenDefinition(@"#t", "TRUE"),

new TokenDefinition(@"#f", "FALSE"),

new TokenDefinition(@"[*<>\?\-+/A-Za-z->!]+", "SYMBOL"),

new TokenDefinition(@"\.", "DOT"),

new TokenDefinition(@"\(", "LEFT"),

new TokenDefinition(@"\)", "RIGHT"),

new TokenDefinition(@"\s", "SPACE")

};

TextReader r = new StringReader(sample);

Lexer l = new Lexer(r, defs);

while (l.Next())

Console.WriteLine("Token: {0} Contents: {1}", l.Token, l.TokenContents);

输出:

Token: LEFT Contents: (

Token: SPACE Contents:

Token: SYMBOL Contents: one

Token: SPACE Contents:

Token: LEFT Contents: (

Token: SYMBOL Contents: two

Token: SPACE Contents:

Token: INT Contents: 456

Token: SPACE Contents:

Token: FLOAT Contents: -43.2

Token: SPACE Contents:

Token: QUOTED-STRING Contents: " \" quoted"

Token: SPACE Contents:

Token: RIGHT Contents: )

Token: RIGHT Contents: )

以上是 可怜的人为C#的“词法分析器” 的全部内容, 来源链接: utcz.com/qa/423478.html

回到顶部