python - remove everything between 2 tags that span branches of an xml tree -
I am trying to extract everything between 2 tags in an XML document, using a python & amp; Lxml. The problem is that tags can be in different branches of the tree (but always at the same depth) can be seen in a document such as this.
& lt; Root & gt; & Lt; P & gt; Hello World & lt; Start / & gt; This is a paragraph & lt; / P & gt; & Lt; P & gt; goodbye World. & Lt; End / & gt; I'm leaving now & lt; / P & gt; & Lt; / Root & gt;
I want to remove everything between the start and end tag, which will result in a single tag:
& lt; Root & gt; & Lt; P & gt; Hello world now I am going & lt; / P & gt; & Lt; / Root & gt; Do not anyone know how this can be done? Python?
You can try to use it as SAX:
Lxml import SkipStartEndTarget the Atry class: def __init __ (self, * Arges, ** kwargs): self.builder = etree.TreeBuilder () self.skip = false def start (self, tag, attrib, nsmap = None): If tag == 'start': self.skip = true if not self.skip: self.builder.start (tag, attrib, nsmap) def data (self, data): if not self.skip: self.builder.data (Data) DEM comment (self, comment): If not self.skip: self.builder.comment (self) def pi (self, target, data): If not self.skip: self.builder.pi (Luck Def, end (self, tag): if not self.skip: self.builder.end (tag) tag if == 'end': self.skip = wrong def off (self): self.skip = false Return self.builder. Use the SkipStartEndTarget
class to create parser target
and you can do that with a custom Create XMLParser
, like this: parser = etree.xmlParser (target = SkipStartEndTarget ())
If needed, you can still give other parser options to the parser. For example: if you are using, you can provide parser functions to this parser, for example:
elem = etree.fromstring (xml_str, parser = parser)
works with etree.XML ()
and etree.parse ()
, and you can paste the parser to etree. Setdefaultparser () can also be set as default parser
(which is probably not a good idea) One thing you can visit: even etree.parse () With
, this is not an elementality Will return only, but always have an element (like etree.XML ()
and Etree.fromstring ()
). I do not think this can be done (anyway), so if this is an issue for you, then you have to work it anyway.
Note that it is possible to use, with Sax events, which is probably something more difficult and slow, unlike the above example, it will return an elementality, but I think it will be .docinfo
does not provide when you will be receiving etree.parse ()
while using. I also believe that (currently) does not support comment and PI. (I have not used it yet, so I can not be more precise at this time)
Also keep in mind that any SAX-like approach is needed to parse the document that & Lt; Start / & gt;
and will still have a well documented, which is the case in your example, but if the second & lt; P & gt;
for example & lt; P2 & gt;
, as you & lt; P & gt; .... & lt; / P2 & gt;
.
Comments
Post a Comment