C#中的模糊文本(句子/标题)匹配

嘿,我正在使用Levenshteins算法来获取源字符串和目标字符串之间的距离。

我也有方法返回值从0到1:

/// <summary>

/// Gets the similarity between two strings.

/// All relation scores are in the [0, 1] range,

/// which means that if the score gets a maximum value (equal to 1)

/// then the two string are absolutely similar

/// </summary>

/// <param name="string1">The string1.</param>

/// <param name="string2">The string2.</param>

/// <returns></returns>

public static float CalculateSimilarity(String s1, String s2)

{

if ((s1 == null) || (s2 == null)) return 0.0f;

float dis = LevenshteinDistance.Compute(s1, s2);

float maxLen = s1.Length;

if (maxLen < s2.Length)

maxLen = s2.Length;

if (maxLen == 0.0F)

return 1.0F;

else return 1.0F - dis / maxLen;

}

但这对我来说还不够。因为我需要更复杂的方式来匹配两个句子。

例如,我想自动标记一些音乐,我拥有原始歌曲名称,并且我有带有垃圾邮件的歌曲,例如 super,quality, 诸如 2007、2008等

年份。等等。还有一些文件只有http:// trash。

.thash..song_name_mp3.mp3,其他都是正常的。我想创建一种算法,该算法现在比我的算法更完美。.也许有人可以帮助我吗?

这是我目前的算法:

/// <summary>

/// if we need to ignore this target.

/// </summary>

/// <param name="targetString">The target string.</param>

/// <returns></returns>

private bool doIgnore(String targetString)

{

if ((targetString != null) && (targetString != String.Empty))

{

for (int i = 0; i < ignoreWordsList.Length; ++i)

{

//* if we found ignore word or target string matching some some special cases like years (Regex).

if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true;

}

}

return false;

}

/// <summary>

/// Removes the duplicates.

/// </summary>

/// <param name="list">The list.</param>

private void removeDuplicates(List<String> list)

{

if ((list != null) && (list.Count > 0))

{

for (int i = 0; i < list.Count - 1; ++i)

{

if (list[i] == list[i + 1])

{

list.RemoveAt(i);

--i;

}

}

}

}

/// <summary>

/// Does the fuzzy match.

/// </summary>

/// <param name="targetTitle">The target title.</param>

/// <returns></returns>

private TitleMatchResult doFuzzyMatch(String targetTitle)

{

TitleMatchResult matchResult = null;

if (targetTitle != null && targetTitle != String.Empty)

{

try

{

//* change target title (string) to lower case.

targetTitle = targetTitle.ToLower();

//* scores, we will select higher score at the end.

Dictionary<Title, float> scores = new Dictionary<Title, float>();

//* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_'

List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));

//* remove all trash from keywords, like super, quality, etc..

targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); });

//* sort keywords.

targetKeywords.Sort();

//* remove some duplicates.

removeDuplicates(targetKeywords);

//* go through all original titles.

foreach (Title sourceTitle in titles)

{

float tempScore = 0f;

//* split orig. title to keywords list.

List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));

sourceKeywords.Sort();

removeDuplicates(sourceKeywords);

//* go through all source ttl keywords.

foreach (String keyw1 in sourceKeywords)

{

float max = float.MinValue;

foreach (String keyw2 in targetKeywords)

{

float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2);

if (currentScore > max)

{

max = currentScore;

}

}

tempScore += max;

}

//* calculate average score.

float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count));

//* if average score is bigger than minimal score and target title is not in this source title ignore list.

if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle))

{

//* add score.

scores.Add(sourceTitle, averageScore);

}

}

//* choose biggest score.

float maxi = float.MinValue;

foreach (KeyValuePair<Title, float> kvp in scores)

{

if (kvp.Value > maxi)

{

maxi = kvp.Value;

matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic);

}

}

}

catch { }

}

//* return result.

return matchResult;

}

这正常工作,但只是在某些情况下,许多标题应该匹配,不匹配…我想我需要某种公式来处理权重等,但我想不到。

有想法吗?有什么建议吗?阿尔戈斯?

顺便说一句,我已经知道了这个主题(我的同事已经发布了这个主题,但是我们不能为这个问题提供适当的解决方案。)

回答:

您的问题可能是区分干扰词和有用数据:

  • Rolling_Stones.Best_of_2003.Wild_Horses.mp3
  • Super.Quality.Wild_Horses.mp3
  • Tori_Amos.Wild_Horses.mp3

您可能需要制作一个干扰词字典来忽略。这似乎很笨拙,但是我不确定是否有一种算法可以区分乐队/专辑名称和噪音。

以上是 C#中的模糊文本(句子/标题)匹配 的全部内容, 来源链接: utcz.com/qa/415624.html

回到顶部