Mike Taulty's Blog
Bits and Bytes from Microsoft UK
LINQ to XML and Larger Files

Blogs

Mike Taulty's Blog

Elsewhere

LINQ to XML and the XML API that underpins it contained in the System.Xml.Linq namespace is essentially a DOM-like API.

This means that documents are loaded into memory to form a "tree" representation and that act of loading into memory means that the bigger the document you have, the more memory you're going to need.

That's not always practical and the existing XML classes in the .NET Framework V2.0 have dealt with this by providing DOM-like functionality (i.e. XmlDocument, XPathDocument and so on) and streaming functionality via XmlReader and XmlWriter. The essential idea of an XmlReader is that it provides forward only, read-only cursor over an XML document.

LINQ to XML does not have a general purpose way of working with large documents (read the discussion here and here) but it does propose a pattern for doing it based on using an XmlReader.

I created a relatively large XML file from selecting data from Northwind using a FOR XML query and then I used a quick PowerShell loop to concat the file onto itself until it got to just less than 100MB. So, it looks like this;

<data>
<customers CustomerID="ABCDE" CompanyName="Microsoft" recordVersion="AAAAAAAAJxE=" />
<customers CustomerID="ALFKI" CompanyName="Alfreds Futterkiste" ContactName="Maria Anders" ContactTitle="Sales Representative" Address="Obere Str. 57" City="Berlin" PostalCode="12209" Country="Germany" Phone="030-0074321" Fax="030-0076545" recordVersion="AAAAAAAAB/g=" />

and it goes on for 100MB.

Reading XML

Say, I want to run a simple query against this to find the number of customers who are living in the UK. I can write;

    StartCounters();

    XElement cont = XElement.Load("Customers_100MB.xml");

    PrintCounters("Loading file");

    StartCounters();

    var query = from c in cont.Descendants("customers")
                where (string)c.Attribute("Country") == "UK"
                select c;

    int x = query.Count();
    
    PrintCounters("Executing query");

    Console.ForegroundColor = ConsoleColor.White;
    Console.WriteLine("Total UK customers {0}", x);

Note, that StartCounters() and PrintCounters() are just little functions that try to print out the elapsed time (not perhaps so useful) and the growth in the Private Bytes of the process between a call to StartCounters() and PrintCounters().

When I run this on the 100MB file, I see that running the program gives me;

After Loading file
        Elapsed time 5257ms
        Additional memory used 189988KB
After Executing query
        Elapsed time 368ms
        Additional memory used 260KB
Total UK customers 11634
Press any key to continue . . .

So, to load the file took me 189MB which is approx twice the size of the file (as an aside, I view that as not so bad for a DOM like API as I always used to think 4xdocument size in the past).

Now, for this particular query (and, generally, for queries that can be satisfied by going forwards through an XML file rather than jumping around in it) we can write what the LINQ to XML documentation calls a "custom axis" function that uses an XmlReader to project XML pretty much "a row at a time" or, perhaps, a "significant fragment at a time" where "significant fragment" means something that helps evaluate the query.

For my query, I can write something like;

  static IEnumerable<XElement> StreamElements(string uri, string name)
  {
    using (XmlReader reader = XmlReader.Create(uri))
    {
      reader.MoveToContent();

      while (reader.Read())
      {
        if ((reader.NodeType == XmlNodeType.Element) &&
          (reader.Name == "customers"))
        {
          XElement element = (XElement)XElement.ReadFrom(reader);
          yield return element;
        }
      }
      reader.Close();
    }
  }

Now, this is a bit of a trick. I say "trick" because I'm using C#'s iterators feature to return back a enumeration of XElement one element at a time which means that we will use the XmlReader to "cursor" our way through the XML file as the iterator is enumerated. Again, this wouldn't work for lots of queries that "jump around" in the XML content but it works here so I can now rewrite my query as;

    StartCounters();

    var query = from c in StreamElements("Customers_100MB.xml", "customers")
                where (string)c.Attribute("Country") == "UK"
                select c;

    int x = query.Count();

    PrintCounters("Executing query");

    Console.ForegroundColor = ConsoleColor.White;
    Console.WriteLine("Total UK customers {0}", x);

Now, if we run this then I see the result;

After Executing query
        Elapsed time 3345ms
        Additional memory used 2416KB
Total UK customers 11634

And it looks like I've used an extra 2MB rather than an extra 180MB in order to achieve the same thing.

Note: My counting of growth in "Private Bytes" here is a little unscientific because all kinds of things might contribute to that byte count (e.g. JITted code and so on) but it's a good enough metric for what I'm doing here.

Writing XML

Now, imagine that I want to take the Customers_100MB.xml file and transform it in some fashion. I've already got a reasonably nice way of streaming the file into my program so what happens if I transform it with code such as;

StartCounters();

    XElement newXML =
      new XElement("customers",
        from element in StreamElements("Customers_100MB.xml", "customers")
        where (string)element.Attribute("Country") == "UK"
        select new XElement("customer",
          new XElement("contactDetails",
            new XAttribute("contactName", (string)element.Attribute("ContactName")),
            new XAttribute("contactTitle", (string)element.Attribute("ContactTitle")))));

    PrintCounters("Creating new XML tree");

    newXML.Save("c:\\temp\\newXml.xml");

 

So, here we're taking the <customer/> elements that come back from the input file and we're writing a new file from them that looks like;

<customers><customer><contactDetails contactName="..." contactTitle="..."/>

and the size of this output file is going to be smaller than 100MB because we're not repeating the entire input file but it's still going to be quite a big file. If I run the code I see;

After Creating new XML tree
        Elapsed time 3977ms
        Additional memory used 7224KB

Now, remembering that I used ~2MB to stream in the input file in the first case this looks like writing out the file has cost an additional ~5MB.

But, in this case, it wasn't really necessary to create the tree referenced by the variable newXML at all. That tree is only used to persist the XML to disk and so it wasn't really necessary to bring the tree into memory (even though I bring it in via streaming).

So, it'd be better if we could stream the XML out in the same way that we're streaming it in. There's a class, XStreamingElement that helps with that and I can use it as below;

    StartCounters();

    XStreamingElement newXML =
      new XStreamingElement("customers",
        from element in StreamElements("Customers_100MB.xml", "customers")
        where (string)element.Attribute("Country") == "UK"
        select new XElement("customer",
          new XElement("contactDetails",
            new XAttribute("contactName", (string)element.Attribute("ContactName")),
            new XAttribute("contactTitle", (string)element.Attribute("ContactTitle")))));

    PrintCounters("Creating new XML tree");

    newXML.Save("c:\\temp\\newXml.xml");

and the change from the former code is to switch XElement to XStreamingElement and that second class doesn't build a tree in memory here - it runs the query at the point where we hit Save and so it both streams data in and out.

Running this code gives me;

After Creating new XML tree
        Elapsed time 15ms
        Additional memory used 428KB

Now, I'm not quite sure how I end up with 400K here when it cost me 2MB just to stream the thing into memory and execute query.Count() but we can clearly see that it's giving a big benefit to "stream in and stream out" rather than to fully build the tree in memory.

Again, this wouldn't work for every query by any means because it depends on reading the XML in a forward-only fashion.

What About VB?

Using C# iterators was a nice way to do this but VB doesn't have iterators so what do you do? Here's a bit of VB doing the same thing as the querying code I originally wrote;

    StartCounters()

    Dim data As XElement = XElement.Load("Customers_100MB.xml")

    Dim query = From c In data...<customers> _
                Where c.@Country = "UK" _
                Select c

    Dim count As Integer = query.Count()

    PrintCounters("After querying")

    Console.WriteLine("Result is {0}", count)

Now, I need a streaming function similar to the StreamElements that we wrote in C# previously so I end up having to build an enumerator;

 Private Function StreamElements(ByVal file As String, ByVal element As String) As IEnumerable(Of XElement)

    Return (New XElementEnumerable(file, element))

  End Function

And then I just wrote (along with the VB editor which wrote some of this code) a quick enumerable/enumerator (not claiming that this is 100% correct but it works in this case);

Imports System.Xml

Public Class XElementEnumerable
  Implements IEnumerable(Of XElement)

  Private Class ElementEnumerator
    Implements IEnumerator(Of XElement)

    Private Sub New()

    End Sub

    Public Sub New(ByVal file As String, ByVal elementName As String)

      Me.elementName = elementName

      reader = XmlReader.Create(file)
      reader.MoveToContent()

    End Sub

    Public ReadOnly Property Current() As System.Xml.Linq.XElement Implements System.Collections.Generic.IEnumerator(Of System.Xml.Linq.XElement).Current
      Get
        Dim element As XElement = CType(XElement.ReadFrom(reader), XElement)
        Return (element)
      End Get
    End Property

    Public ReadOnly Property Current1() As Object Implements System.Collections.IEnumerator.Current
      Get
        Throw New InvalidOperationException("Not implemented")
      End Get
    End Property

    Public Function MoveNext() As Boolean Implements System.Collections.IEnumerator.MoveNext

      Dim moved As Boolean = False

      While (Not moved AndAlso reader.Read())

        If reader.NodeType = XmlNodeType.Element And reader.Name = elementName Then
          moved = True
        End If
      End While

      Return (moved)

    End Function

    Public Sub Reset() Implements System.Collections.IEnumerator.Reset

      Throw New InvalidOperationException("Can not reset this enumerator")

    End Sub

    Private disposedValue As Boolean = False    ' To detect redundant calls

    ' IDisposable
    Protected Overridable Sub Dispose(ByVal disposing As Boolean)
      If Not Me.disposedValue Then
        If disposing Then
          ' TODO: free other state (managed objects).
        End If

        ' TODO: free your own state (unmanaged objects).
        ' TODO: set large fields to null.
      End If
      Me.disposedValue = True
    End Sub

    Private elementName As String
    Private reader As XmlReader

#Region " IDisposable Support "
    ' This code added by Visual Basic to correctly implement the disposable pattern.
    Public Sub Dispose() Implements IDisposable.Dispose
      ' Do not change this code.  Put cleanup code in Dispose(ByVal disposing As Boolean) above.
      Dispose(True)
      GC.SuppressFinalize(Me)
    End Sub
#End Region

  End Class

  Private Sub New()

  End Sub

  Public Sub New(ByVal file As String, ByVal elementName As String)
    Me.file = file
    Me.elementName = elementName
  End Sub
  Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of System.Xml.Linq.XElement) _
    Implements System.Collections.Generic.IEnumerable(Of System.Xml.Linq.XElement).GetEnumerator

    Return New ElementEnumerator(file, elementName)

  End Function

  Public Function GetEnumerator1() As System.Collections.IEnumerator _
    Implements System.Collections.IEnumerable.GetEnumerator

    Throw New InvalidOperationException("Not implemented")

  End Function
  Private file As String
  Private elementName As String

End Class

and that does the trick in VB and gets me back to streaming my input XML into my program.


Posted Sat, Sep 8 2007 6:20 AM by mtaulty
Filed under: , ,

Comments

Christopher Steen wrote Link Listing - September 8, 2007
on Sun, Sep 9 2007 12:17 AM
Link Listing - September 8, 2007
Mike Taulty's Blog wrote TechEd Europe - LINQ to XML Session Code
on Sat, Nov 10 2007 4:12 AM
I'm just back from TechEd Europe where I did a few sessions around LINQ to XML and LINQ to Entities along...
Query data (SELECT * FROM table) from XML File | keyongtech wrote Query data (SELECT * FROM table) from XML File | keyongtech
on Sun, Jan 18 2009 9:31 AM