Registered: 2/12/03
Looking For an HTML Parser  
Aug 5, 2004 10:13 AM


I'm trying to parse an HTMl document for the attributes of the following tags: <IMG> and <A>

I thought of using an XML parser (SAXParser) but there's some problem:

1. html document is not xml well form (some missing end tag)
2. script (such as javascript, vbscript)..would this affect the sax parser????

I thought of converting the html document to dhtml..so it will be well form to use an XML Parser
but i don't know of any convertor...I'm not good with google..so if someone can help me with keyword to google..i would apperciate it.

I thought using an HTML Parser, but ran into problems:
1. i have searched google and so many result came up. don't know which HTML parser is good to use
, so i'm hoping someone in the forum has use an HTML parser before and can recommend me one to
use. what i'm looking for is:
a) some very simple API to parse and extract the HTMl document
b) wont crash on javascript, vbscript, script, etc..
c) supports at least HTML 4
d) can handles missing end tags

or if you have another solution...i'm all ear

thanx in advance


I'm looking for an HTML parser that will parse an html document (html document is on the desktop)
I was looking for a parser that can handles all HTML tags (including javascript, vb script, etc)

although i only need to extract the attributes for the <IMG> and <A> tags

i googled and comes up with so many parser available.
I'm asking because..i'm looking for good parser and simple API.
hoping that some of you have worked with an html parser can guide me to one.

i rather use XML parser, but the page is not XML well form...some missing tag..


Registered: 4/30/99
Re: Looking For an HTML Parser  
Aug 5, 2004 11:30 AM (reply 1 of 2)

Use HTMLTidy to make your rubbishy HTML into well-formed XHTML. Then it's XML and away you go.

If you have to do this many times in a programmable way then there's JTidy.

Registered: 2/12/03
Re: Looking For an HTML Parser  
Aug 5, 2004 11:55 AM (reply 2 of 2)

no wonder i didn't get result from google..i was searching "DHTML" instead of "XHTML"

Thanx Dr Clap

HTMLTidy is the sort of tool i needed.
and from HTMLTidy page...i got the key wordm which was "HTML Validate" and google gave me lots of products tat validate and convert HTML to XHTML (or at least help in the conversion)
