Home arrow static arrow Java Programming [Archive] - Regular Expressions
Warning: Creating default object from empty value in /www/htdocs/w008deb8/wiki/components/com_staticxt/staticxt.php on line 51
Java Programming [Archive] - Regular Expressions
This topic has 8 replies on 1 page.

Posts:9
Registered: 7/26/04
Regular Expressions  
Jul 28, 2004 9:54 AM



 
I definitively not a regexp expert ! ;)

Here's my problem.

I'm working on a string that looks like :
sin(1)+12.22-x2+x3+1/x4+1*17+56*ln(32+sin(26*x1))+exp(58)+1+ln(12)+sin(26)+pow(2,3)


etc etc ....

This string is built by an user, and I have to refactor it so that the jvm could understand it.

x1 ... x5 are arguments which type is Double.

I already have some regexps that tranform ln to Math.log, sin to Math.sin etc ... This is working perfectly, thanks to the help of some guys from this forum ;)

But as I'm working with java.lang.Math, I must use Double everywhere, especially with fonction like pow() : indeed, if i do sth like pow(2,3), java will throw me an exception (normal behaviour?).

So i must tranform all numbers like 2, 12558651 in 2.0 and 12558651.0. But i mustn't changed 12.36 in 12.36.0, .45 in .45.0 or x1 in x1.0.

I made this regexp :
System.out.println("x1 : "+"x1".matches("(?<!\\.|x)[0-9]+[^\\.]")); //--> falseSystem.out.println("12 : "+"12".matches("(?<!\\.|x)[0-9]+[^\\.]")); //--> true System.out.println(".12 : "+".12".matches("(?<!\\.|x)[0-9]+[^\\.]"));//--> falseSystem.out.println("1245.12 : "+"1245.12".matches("(?<!\\.|x)[0-9]+[^\\.]"));//--> false

So I thought I had found the good regexp ! Actually not ! :(

I'm using a code that looks like :
Pattern p = Pattern.compile(myPattern);Matcher m = p.matcher(a);StringBuffer buf = new StringBuffer(myChain);int pos=0;while(true){	if(m.find(pos)){		buf.insert(m.end()-1,".0");		pos = m.end()+m.group().length()-1;		System.err.println("i : "+(++i)+" - "+buf.toString());		System.err.println("I found the text \"" + m.group() +		"\" starting at index " + m.start() +		" and ending at index " + m.end() + " --> pos : "+pos);				m = p.matcher(buf.toString());	}	else{		a = buf.toString();		break;	}}

The string I test is :
pow(2,3)+sin(1)+12.22-x2+x3+1/x4+1*17+56*ln(32+sin(26*x1))+exp(58)+1+ln(12)+sin(26)


Here's the result if I use the pattern (?<!\\.|x)[0-9]+[^
.]
:
pow(2.0,3.0)+sin(1.0)+1.02.22.0-x2+x3+1.0/x4+1.0*17.0+56.0*ln(32.0+sin(26.0*x1))+exp(58.0)+1.0+ln(12.0)+sin(26.0)

This pattern has a big problem with numbers like 78.69 or 1.8963 : it transforms 12.22 in 1.02.22.0 !
But the remaining is "perfect".


If I use (?<!\\.|x)[0-9]+[^
.0-9]
, here's the result :
pow(2.0,3.0)+sin(1.0)+12.22.0-x2+x3+1.0/x4+1.0*17.0+56.0*ln(32.0+sin(26.0*x1))+exp(58.0)+1.0+ln(12.0)+sin(26.0)

There's still a mistake as it changes 12.22 in 12.22.0 ....

Another pattern I tried was : \\b(?<!\\.|x)[0-9]+[^\\.0-9]
b
!
Here's the result :
pow(2.0,3)+sin(1)+12.22-x2+x3+1.0/x4+1.0*17.0+56.0*ln(32.0+sin(26.0*x1))+exp(58)+1.0+ln(12)+sin(26)

Here, there's no more problems with 12.22 but there are with integers ..

Can anyone helps me to point out my mistakes ?
Maybe I shouldn't use regexp ...
Tonight, I promise, I'll buy a book for mastering regexps on Amazon ... :)
 

Posts:9
Registered: 7/26/04
Re: Regular Expressions  
Jul 28, 2004 10:14 AM (reply 1 of 8)



 
Bored, I made it that way :
Pattern p = Pattern.compile("\\b(?<!\\.)[0-9]+\\b");Matcher m = p.matcher(myChain);StringBuffer buf = new StringBuffer(myChain);int pos=0;while(true){	if(m.find(pos)){		buf.insert(m.start(),"(double)");		pos = m.end()+9;		m = p.matcher(buf);	}	else{		myChain = buf.toString();		m = null;		p = null;		buf = null;		break;	}}

But if anyone has the solution for the problem I exposed before .... ;)
 

Posts:9
Registered: 7/26/04
Re: Regular Expressions  
Jul 28, 2004 11:36 AM (reply 2 of 8)



 
any idea ?
 

Posts:27,518
Registered: 11/3/97
Re: Regular Expressions  
Jul 28, 2004 12:02 PM (reply 3 of 8)



 
any idea ?

You can't write a single regular expression that will parse the syntax that you are suggesting. This is proven in "Mastering Regular Expressions" if you wish to verify.

You need to write a parser instead. You can use something like JavaCC (or other parser type tools) or build your own (perhaps using some of the ideas presented in the first couple of chapters of "Compilers" by Aho or some other compiler theory source documentation.)
 

Posts:2,391
Registered: 9/26/00
Re: Regular Expressions  
Jul 28, 2004 7:59 PM (reply 4 of 8)



 
  myChain = myChain.replaceAll("(?<![\\w.])\\d++(?!\\.)", "$0.0");
 

Posts:2,391
Registered: 9/26/00
Re: Regular Expressions  
Jul 29, 2004 3:55 AM (reply 5 of 8)



 
I didn't have time for the explanation earlier, so here it is.

(BTW, don't get a regex book, get the regex book: [url=http://www.amazon.com/exec/obidos/ASIN/0596002890/masteringregu-20]MRE2[/url]. Nothing else comes close.)

When crafting a regex, it helps to state the problem as precisely as you can. You want to match all the integers in the input string, and you've defined an integer as (1) a sequence of digits that (2) isn't preceded by an 'x' or a dot and (3) isn't followed by a dot. Translating that into regex, (1) obviously becomes [0-9]+ (or its equivalent,
d+
). For (2), "isn't preceded by" suggests negative lookbehind: (?<![.x]) (a character class is more efficient than alternation, and you don't have to escape the dot).

Similarly, the "isn't followed by" in (3) suggests negative lookahead - (?!
.)
- but you used a negated character class instead ([^.], originally). Instead of just looking at the next character, you went ahead and matched it. That isn't necessarily an error, but it made things more difficult than they needed to be. It advanced the match position one space beyond the end of the integer, which complicated the replacement part of the task and screwed you up later when you tried to refine the regex (the
b
should have gone before the character class, not after -- it wouldn't have helped, but putting it where you did was definitely an error).

But the trickiest part of the problem is not (2) or (3), it's (1). When we say we want to match a sequence of digits, it goes without saying that we mean the whole sequence: if there are six digits in a row, we want to match all six of them, or nothing. But when you're writing a regex, you often do have to say that kind of thing, and it's not always easy to get the point across. Consider what happens when the regex (?<![.x])\\d+(?!
.)
is applied to "12.22". The negative lookbehind gives the thumbs-up, and
d+
consumes the first two digits, but the next character is a dot, so the negative lookahead fails. Trying to salvage the overall match,
d+
gives up one of its digits, and (?!
.)
tries again, this time between the '1' and the '2'. There it succeeds, and an overall match is achieved (incorrectly) on the string "1".

On the next match attempt, the regex engine tries to match at the first '2' again and fails, then tries and fails to match at the decimal point. At the second '2', the negative lookbehind correctly prevents a match, so one more bump-along occurs. At the next position, the lookbehind sees nothing to object to, so
d+
consumes the final '2', the lookahead is also happy, and another incorrect match is achieved.

The moral is, if you want to match only whole sequences of digits, you have to ensure that what you match is neither preceded nor followed by digits. That's effectively what you did when you added the
b
to the beginning of your regex and changed the last part to [^.0-9] (in fact, I think your code would have worked if you hadn't also added that
b
at the end). But it's much neater just to include digits in the lookarounds that are already there (in my version, that is): (?<![.x\\d])\\d+(?![.
d])
.

But wait -- that's not the regex I used in my earlier post! The difference in the lookbehind - (?<![
w.])
- isn't really significant, but what's up with the rest of it? Well, the thing about the negative lookahead approach is that it doesn't prevent backtracking; it just prevents the incorrect matches that would be caused by backtracking. So, if the regex I just built were applied to a sequence of 100 digits followed by a dot, the main part would consume all the digits, but the negative lookahead would barf on the dot. Then it would start backtracking, giving up one character at a time and reapplying the lookahead at each position until it finally arrived back at the first dogit and gave up. That's ninety-nine totally pointless operations.

This is what possessive quantifiers are for. By adding that second plus sign -
d++
- I'm telling the regex engine to match as many digits as it can (and at least one, of course) and never give any of them back, even if that makes an overall match impossible. Since it will never backtrack, it doesn't have to save all the intermediate state information that makes backtracking possible, so it save memory as well as time. Of course, none of this stuff really matters in this context -- negative lookahead works just fine. I just wanted to seize this opportunity to demonstrate the use of possessive quantifiers in a simple, real-world application.

Finally, there's the matter your home-grown replacement code -- you're making things way too complicated! Your first recourse should always be replaceAll(); if you can't use that (maybe you need to do some extra processing on the replacement text), try appendReplacement()/appendTail() in the Matcher class. And if you can't accomplish your goal with those, the regex package is probably just the wrong tool for the job.
 

Posts:9
Registered: 7/26/04
Re: Regular Expressions  
Jul 29, 2004 5:52 AM (reply 6 of 8)



 
thanks a lot for your explanation ! ;)
 

Posts:24,036
Registered: 2/3/03
Re: Regular Expressions  
Jul 29, 2004 6:10 AM (reply 7 of 8)



 
Uncle_alice is the shiznit, no doubt.

That's just awesome!
 

Posts:9
Registered: 7/26/04
Re: Regular Expressions  
Jul 29, 2004 6:24 AM (reply 8 of 8)



 
and you were true, this pattern works perfectly :
\\b(?<!\\.|x)[0-9]+[^\\.0-9]
;)
 
This topic has 8 replies on 1 page.