Home arrow static arrow Java Programming [Archive] - What alternative wuld there be to using StreamTokenizer?
Warning: Creating default object from empty value in /www/htdocs/w008deb8/wiki/components/com_staticxt/staticxt.php on line 51
Java Programming [Archive] - What alternative wuld there be to using StreamTokenizer?
This topic has 15 replies on 2 pages.    1 | 2 | Next »

Posts:20
Registered: 10/20/03
What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:11 AM



 
Hello,
Currently I am using a StreamTokenizer to parse data from a file. The problem is that it goes way too slow. I'd like a faster way to process the data. If there is a better alternative to using StreamTokenizer I'd like to know. I've looked around and haven't found anything. I though using the java.nio stuff might help but now the I/O is faster in comparison to the BufferedReader, but the StreamTokenizer is keeping the I/O capabilities slow. The file sizes range from 9M - 20M and I need to find values and hold the data. I listed a very generic structure for the files I'm reading in below. I use the Headings(ex: [Values]) to sort through the data under the heading and then move on using the token.nextToken. If anyone could give me some ideas I'd appreciate it. Thank you in advance.
An Example:

[Values]
1 = "String1"
2 = "String2"

[Data Part1]
JOB_1 = 1021201212
PART_1 = 21231331

[TESTS]
T_000_= Data; Data;Data
through
T_1442 = Data; Data; Data
 

Posts:403
Registered: 9/4/03
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:18 AM (reply 1 of 15)



 
Take a look at the regex API's in the NIO libraries. They will probably be faster.
 

Posts:19,725
Registered: 9/26/01
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:19 AM (reply 2 of 15)



 
9MB-20MB for a file that's in "human-readable" form that needs to be tokenized? Erm, thought about maybe a database instead? Something more 'machine-readable'?
 

Posts:20
Registered: 10/20/03
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:34 AM (reply 3 of 15)



 
I haven't quite reached the regex stuff yet in the book but it sounds very interesting. Using a database is not an option for the time being. There are a couple different application engines that process different data and create reports for the data. The data being passed to my engine is one of the reports and for the time being I have to parse it as a document/file. The time it takes to read and store the data can vary from 30 seconds to two minutes depending on the data. I hope that the regex stuff works better. Thanks for your help.
 

Posts:19,725
Registered: 9/26/01
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:40 AM (reply 4 of 15)



 
I love it when the scenario is basically:
Q: The design of the system is messed up, but I have to live with it. How can I live with the design and still make it run as if it were designed correctly in the first place? (In this example, a humongous text file being processed by code instead of humans is the "design")
A: The design has to be fixed (In this example, the humongous text file should probably have been a database/binary/machine-readable thingy in the first place)

Response: But I have to live with the design...

The old addage "garbage in, garbage out" applies.
 

Posts:20
Registered: 10/20/03
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:53 AM (reply 5 of 15)



 
I'm sorry you don't have a clue on how to fix the design. The design wasn't all my coding. But I guess we could
send you to all of our customers and tell them why they have to invest in a database for each of the sites we have our software installed on. I'm sure they'd love that. I can think of a few ways to change the problem but not without creating a different problem or displeasing customers. Currently there is no database option but in the future there may be. A nice little quote for you: "If your not part of the solution your part of the problem". Try actually giving advice instead of criticism and give some structure instead of a one-liner. I'd think that most file I/O has to do with readable content, but I guess that isn't your strong point. :-)
 

Posts:19,725
Registered: 9/26/01
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:56 AM (reply 6 of 15)



 
Ok, then just another one-liner for you: buzz off
 

Posts:24,517
Registered: 98-02-27
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 8:11 PM (reply 7 of 15)



 
Take a look at the regex API's in the NIO libraries. They will probably be faster.

Regular Expressions are 4 times slower than using other methods (like String.indexOf) or classes (like StringTokenizer). On a 20M file this is probably important.

I don't see any need to use the StreamTokenizer to parse the file. Based on the sample data you provided just use a simple buffered reader.

a) read a line. If it starts with "[" you've got a header record
b) create a loop reading line by line until the next header record.
c) if you need to parse the data in the detail lines then use a StringTokenizer or String.indexOf() to parse out the data.
 

Posts:2,909
Registered: 13.8.2003
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 3, 2004 10:35 PM (reply 8 of 15)



 
Regular Expressions are 4 times slower than using
other methods (like String.indexOf) or classes (like
StringTokenizer). On a 20M file this is probably
important.

I'd like to know where you got that "4 times slower".
 

Posts:24,517
Registered: 98-02-27
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 4, 2004 11:31 AM (reply 9 of 15)



 
I'd like to know where you got that "4 times slower".

By writing test programs.
 

Posts:27,518
Registered: 11/3/97
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 4, 2004 11:49 AM (reply 10 of 15)



 
Regular Expressions are 4 times slower than using
other methods (like String.indexOf) or classes (like
StringTokenizer). On a 20M file this is probably
important.

I'd like to know where you got that "4 times slower".

I haven't tested it myself, but there are other threads that have posted tests that demonstrated a significant speed difference.

Myself I have written a regex library and I have also written parsers. And someone who make a regex library as fast as or faster than a custom parser is going to be a very smart person (presuming of course that it isn't possible to prove that that isn't impossible in the first place.)
 

Posts:27,518
Registered: 11/3/97
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 4, 2004 11:50 AM (reply 11 of 15)



 
Currently I am using a StreamTokenizer to parse data
from a file.

Write a custom parser.

And if you parsing the same data more than once then put the data into another format on the first parse so it is faster on the subsequent tries.
 

Posts:3,258
Registered: 00-08-28
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 4, 2004 12:28 PM (reply 12 of 15)



 
Currently I am using a StreamTokenizer to parse
data
from a file.

Write a custom parser.

And if you parsing the same data more than once then
put the data into another format on the first parse so
it is faster on the subsequent tries.

I think this would be your best bet right now. Because

public static void mySimpleTokenizer()String s, String delimiter){   String sub = null;   int i =0;   int j =s.indexOf(delimiter);  // First substring    while( j >= 0) {   sub = s.substring(i,j);	i = j + 1;	j = s.indexOf(delimiter, i);   // Rest of substrings   }   sub = s.substring(i); // Last substring}

I read some where that the above method works almost 4 times faster than StringTokenizer, because of less overhead.

Also just wondering if the following links could be of any help
http://ostermiller.org/utils/StringTokenizer.html
http://www.javaperformancetuning.com/news/roundup032.shtml
http://www.ftponline.com/javapro/2002_08/online/servletsjsp_08_06_02/Java%20Servlets%20Ch16.pdf

 

Posts:3,258
Registered: 00-08-28
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 4, 2004 12:36 PM (reply 13 of 15)



 
Sorry disregard my previous post I guess I had too much of lunch and mistook stream for string. TOOO BAD
 

Posts:403
Registered: 9/4/03
Re: What alternative wuld there be to using StreamTokenizer?  
Aug 5, 2004 9:59 AM (reply 14 of 15)



 
Interesting. I had no idea that regex was slower than indexOf or tokenizers. Learn something new every day.
 
This topic has 15 replies on 2 pages.    1 | 2 | Next »