Forum

> > CS2D > Scripts > return string diffrence in %
Forums overviewCS2D overview Scripts overviewLog in to reply

English return string diffrence in %

14 replies
To the start Previous 1 Next To the start

old return string diffrence in %

EnderCrypt
User Off Offline

Quote
could someone help me make a function that return in percent how much difference there is at 2 strings

i could just make so it compares ever letter but if theres 1 letter wrong at beginning for example then it would be different at nearly all letters after it

it doesn't need to return EXACT difference in percent, just somewhat correct

anyone got any idea?

old Re: return string diffrence in %

archmage
User Off Offline

Quote
Why not just get the number of differences between both strings?
More >

old Re: return string diffrence in %

archmage
User Off Offline

Quote
ofc it will return 2. 1 (the difference in the length of the strings) + 1 (the number of characters that are different)

If you do not want it to add the difference in length change
1
local error = math.abs(...)
to
1
local error = 0
edited 1×, last 19.06.11 08:04:31 pm

old Re: return string diffrence in %

EnderCrypt
User Off Offline

Quote
hm i dont think it gets better...

any other ideas

only i wanna do is to compare chat (hook "say") with list, but it doesn't need to be EXACT like a word to be recognized

oh wait maybe i.. 1 sec

old Re: return string diffrence in %

Lee
Moderator Off Offline

Quote
There are two algorithms that satisfy what you're looking for. One is semantic while the other one is quantitative.

Diff:
http://luaforge.net/projects/diffmatchpatch/
Returns the semantic difference between two strings. You can compute the percent difference by dividing the total difference into the total length.

Needleman-Wunsch:
http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
Devised for genomic alignment, under an identity matrix with 0 gap penalty (assuming that the gap is part of the character set), this algorithm can also be used to align strings of text. This is different from diff in that the output is the optimization of what you're looking for. IE: It generates the complement so that the string difference % (compared character to character) is the absolute smallest.

There doesn't seem to be any current lua implementations of this algorithm, but I do have a reference copy that I wrote two years ago that performs in linear time complexity, I'll see if I can't dig it up.

old Re: return string diffrence in %

Lee
Moderator Off Offline

Quote
I couldn't find it so I rewrote it from scratch today.

https://gist.github.com/1040989

After which
1
2
3
align("I am the very model of a modern Major-General,", "I am the very model of a cartoon individual,")

align("I've information vegetable, animal, and mineral,", "My animation's comical, unusual, and whimsical,")

outputs

1
2
3
4
5
I am the very model of a -modern Major-General,
I am the very model of a cartoon ---individual,

I've information-- vegetable, an-imal, and -mi-neral,
--My an--imation's comica-l-, unusual, and whimsical,

You can just find the differences between the pair of returned string, or you can do the calculation during alignment.

https://gist.github.com/1041007

So for example, if we were to find the percent difference between two very similar strings, we would expect a low value.

1
2
3
percent_difference("I am the very model of a modern Major-General,", "I am the very model of a cartoon individual,")

returns 0.085106382978723, or only 8.51% difference

while the dissimilar strings give us larger differences

1
2
3
percent_difference("I've information vegetable, animal, and mineral,", "My animation's comical, unusual, and whimsical,")

returns 0.20754716981132, or over 20% difference

old Re: return string diffrence in %

Lee
Moderator Off Offline

Quote
http://www.lua.org/demo.html

and if you're really pressed for speed and are expecting strings of over 1000 words to be parsed in under a second, here's it in c++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
#include <iostream>
#include <cmath>
#include <string>
#include <algorithm>
#include <lua.h>

using namespace std;

// Find the directory to the lua root and substitute for C:\lua
//g++ --shared -I"C:\lua\include" nw.cpp "C:\lua\lib\lua51.lib" -o nw.dll


enum NW_PENALTIES{
	MATCH = 1,
	NOMATCH = -1,
	GAP = -1
};

enum NW_DIR{
	mch,
	hrz,
	vrt
};

int match(char a, char b){
	return (a==b) ? MATCH : NOMATCH;
}

int* select(int o, int top, int left){
	int* _ret = (int*)malloc(2*sizeof(int));
	_ret[0] = hrz;_ret[1] = left;
	if (o>= top && o>= left){
		_ret[0] = mch;_ret[1] = o;
		return _ret;
	}else if (top >= left){
		_ret[0] = vrt;_ret[1] = top;
		return _ret;
	}
	return _ret;
}

int*** makeMatrix(string A, string B){
	int a = A.length()+1, b = B.length()+1;
	int*** _matrix = (int***)malloc(a*sizeof(int));

	for (int i=0;i<a;i++){
		_matrix[i] = (int**)malloc(b*sizeof(int));
		for (int j=0;j<b;j++){
			_matrix[i][j] = (int*)malloc(2*sizeof(int));
			if (i == 0)
				_matrix[i][j][0] = hrz,
				_matrix[i][j][1] = j*NOMATCH;
			else if (j == 0)
				_matrix[i][j][0] = vrt,
				_matrix[i][j][1] = i*NOMATCH;
			else
				_matrix[i][j][0] = 0,
				_matrix[i][j][1] = 0;
		}
	}
	for (int i=1;i<a;i++){
		for (int j = 1; j<b; j++){
			int m, v, h;
			// m = i-1, j-1
			// v = i-1, j
			// h = i,   j-1
			m = (_matrix)[i-1][j-1][1] + match(A[i-1],B[j-1]);
			v = (_matrix)[i-1][j][1] + GAP;
			h = (_matrix)[i][j-1][1] + GAP;
			_matrix[i][j] = select(m,v,h);

		}
	}
	return _matrix;
}

int numtrace(int*** _matrix, string A, string B){
	int i__ = A.length(), j__ = B.length();
	int n = 0;
	while (i__>0 && j__ >0){
		int dir = _matrix[i__][j__][0];
			switch(dir){
				case 0:
					i__--;j__--;
					break;
				case 1:
					j__--;
					break;
				case 2:
					i__--;
					break;
			}
		n++;
	}
	return n+max(i__,j__);
}

int trace(int*** _matrix, string A, string B, int* num){
	int n = numtrace(_matrix, A, B);
	*num = n;
	int i = A.length(), j = B.length();
	int mutations = 0;
	while (i>0 || j >0){
		int dir = _matrix[i][j][0];
		switch(dir){
			case 0:
				if (A[i-1] != B[j-1]) mutations++;
				i--;j--;
				break;
			case 1:
				j--;
				mutations++;
				break;
			case 2:
				i--;
				mutations++;
				break;
		}
		n--;
	}
	return mutations;
}

char** trace_align(int*** _matrix, string A, string B){

	char** sequences = (char**)malloc(2*sizeof(char));
	int n = numtrace(_matrix, A, B);
	sequences[1] = (char*)malloc(n*sizeof(char)); // +1 to allocate for string type
	sequences[0] = (char*)malloc(n*sizeof(char));

	int i = A.length(), j = B.length();
	int mutations = 0;
	while (i>0 || j >0){
		int dir = _matrix[i][j][0];
		cout << dir << " " << n << "\n";
		switch(dir){
			case 0:
				// back both ways
				//cout << i << " , " << j << "\t" << A[i-1] << " " << B[j-1] << endl;
				sequences[1][n-1] = A[i-1];
				sequences[0][n-1] = B[j-1];
				if (A[i-1] != B[j-1]) mutations++;
				i--;j--;
				break;
			case 1:
				// left
				//cout << i << " , " << j << "\t" << "-" << " " << B[j-1] << endl;
				sequences[1][n-1] = '-';
				sequences[0][n-1] = B[j-1];
				j--;
				mutations++;
				break;
			case 2:
				// up
				//cout << i << " , " << j << "\t" << A[i-1] << " " << "-" << endl;
				sequences[1][n-1] = A[i-1];
				sequences[0][n-1] = '-';
				i--;
				mutations++;
				break;
		}
		n--;
	}
	return sequences;
}

int mutate_(string A, string B, int* n){
	int*** matrix = makeMatrix(A,B);
	int seqs = trace(matrix,A,B, n);
	return seqs;
}

char** align_(string A, string B){
	int*** matrix = makeMatrix(A,B);
	char** seqs = trace_align(matrix,A,B);
	return seqs;
}

static int l_align(lua_State *l){
	char** align = align(luaL_checkstring(l, 1), luaL_checkstring(l, 2));
	lua_pushstring(align[0]);
	lua_pushstring(align[1]);
	return 2;
}

static int l_mutate(lua_State *l){
	int n = 0;
	int mutations = mutate_(luaL_checkstring(l, 1), luaL_checkstring(l, 2), &n);
	lua_pushinteger(l, mutations);
	lua_pushinteger(l, n);
	return 2
}

static const luaL_Reg new_reg[] =
{
	{ "align",	l_align},
	{ "percent_difference", l_mutate},
	{ NULL,		NULL }
};


LUALIB_API int luaopen_nw(lua_State *l){
	luaL_register(l,"nw",nw_reg);
	return 1;
}
edited 1×, last 22.06.11 11:15:06 pm

old Re: return string diffrence in %

EnderCrypt
User Off Offline

Quote
... Impressing... But, hows that gonna help , i mean c++
btw is it possible to decrease exactness* of the function to get more speed

Cause i only need to know if its over like 80-90% correct or somheting


*somheting

old Re: return string diffrence in %

Banaan
User Off Offline

Quote
C++ is a lot faster because you run the code when it's already compiled (you have to build an exe or a dll from this, so it's in a format a computer can understand). If you run Lua, or pretty much any non-C language, the code has to be passed through an interpreter first. This interpreter will parse the Lua code into a format readable for a computer during runtime, which slows the process down.

In other words, maximum speed can only be reached if you compile the code first, which can be done with C++.


I hope what I said there was a little correct; I don't know that much about how programming languages exactly work.

old Re: return string diffrence in %

Flacko
User Off Offline

Quote
Well, Lua compiles it's code into lua code (raw operations and instructions for the lua interpreter) once before running (it usually takes a second or two) and this lua code will probably never beat machine-code such as the one that C++ compilers generate.

The reason for being slower than C/C++ is probably because it's a language built over a C stack so it works with pointer operations for adding/removing values from the stack (not to mention that values are usually being copied instead of being passed by reference).

old Re: return string diffrence in %

Lee
Moderator Off Offline

Quote
The C++ code I posted, when compiled using this command

1
2
rem Find the directory to the lua root and substitute for C:\lua
g++ --shared -I"C:\lua\include" nw.cpp "C:\lua\lib\lua51.lib" -o nw.dll

will output a lua compatible extension that using

1
2
package.cpath = package.cpath .. ";lib/?.dll"
require "nw"

will create a table in the global namespace with two functions, align and percent_difference, each with functionality detailed from above. The main difference here is that we create a matrix in contiguous space, hence helping locality and reducing cache misses and speeding up the matrix operations significantly, seeing as Lua only holds pointers on the stack. At the same time we reduce the overhead associated with lua's runtime.
To the start Previous 1 Next To the start
Log in to reply Scripts overviewCS2D overviewForums overview