LINQ to XML and the XML API that underpins it contained in the System.Xml.Linq namespace is essentially a DOM-like API.
This means that documents are loaded into memory to form a "tree" representation and that act of loading into memory means that the bigger the document you have, the more memory you're going to need.
That's not always practical and the existing XML classes in the .NET Framework V2.0 have dealt with this by providing DOM-like functionality (i.e. XmlDocument, XPathDocument and so on) and streaming functionality via XmlReader and XmlWriter. The essential idea of an XmlReader is that it provides forward only, read-only cursor over an XML document.
LINQ to XML does not have a general purpose way of working with large documents (read the discussion here and here) but it does propose a pattern for doing it based on using an XmlReader.
I created a relatively large XML file from selecting data from Northwind using a FOR XML query and then I used a quick PowerShell loop to concat the file onto itself until it got to just less than 100MB. So, it looks like this;
<data>
<customers CustomerID="ABCDE" CompanyName="Microsoft" recordVersion="AAAAAAAAJxE=" />
<customers CustomerID="ALFKI" CompanyName="Alfreds Futterkiste" ContactName="Maria Anders" ContactTitle="Sales Representative" Address="Obere Str. 57" City="Berlin" PostalCode="12209" Country="Germany" Phone="030-0074321" Fax="030-0076545" recordVersion="AAAAAAAAB/g=" />
and it goes on for 100MB.
Reading XML
Say, I want to run a simple query against this to find the number of customers who are living in the UK. I can write;
StartCounters();
XElement cont = XElement.Load("Customers_100MB.xml");
PrintCounters("Loading file");
StartCounters();
var query = from c in cont.Descendants("customers")
where (string)c.Attribute("Country") == "UK"
select c;
int x = query.Count();
PrintCounters("Executing query");
Console.ForegroundColor = ConsoleColor.White;
Console.WriteLine("Total UK customers {0}", x);
Note, that StartCounters() and PrintCounters() are just little functions that try to print out the elapsed time (not perhaps so useful) and the growth in the Private Bytes of the process between a call to StartCounters() and PrintCounters().
When I run this on the 100MB file, I see that running the program gives me;
After Loading file
Elapsed time 5257ms
Additional memory used 189988KB
After Executing query
Elapsed time 368ms
Additional memory used 260KB
Total UK customers 11634
Press any key to continue . . .
So, to load the file took me 189MB which is approx twice the size of the file (as an aside, I view that as not so bad for a DOM like API as I always used to think 4xdocument size in the past).
Now, for this particular query (and, generally, for queries that can be satisfied by going forwards through an XML file rather than jumping around in it) we can write what the LINQ to XML documentation calls a "custom axis" function that uses an XmlReader to project XML pretty much "a row at a time" or, perhaps, a "significant fragment at a time" where "significant fragment" means something that helps evaluate the query.
For my query, I can write something like;
static IEnumerable<XElement> StreamElements(string uri, string name)
{
using (XmlReader reader = XmlReader.Create(uri))
{
reader.MoveToContent();
while (reader.Read())
{
if ((reader.NodeType == XmlNodeType.Element) &&
(reader.Name == "customers"))
{
XElement element = (XElement)XElement.ReadFrom(reader);
yield return element;
}
}
reader.Close();
}
}
Now, this is a bit of a trick. I say "trick" because I'm using C#'s iterators feature to return back a enumeration of XElement one element at a time which means that we will use the XmlReader to "cursor" our way through the XML file as the iterator is enumerated. Again, this wouldn't work for lots of queries that "jump around" in the XML content but it works here so I can now rewrite my query as;
StartCounters();
var query = from c in StreamElements("Customers_100MB.xml", "customers")
where (string)c.Attribute("Country") == "UK"
select c;
int x = query.Count();
PrintCounters("Executing query");
Console.ForegroundColor = ConsoleColor.White;
Console.WriteLine("Total UK customers {0}", x);
Now, if we run this then I see the result;
After Executing query
Elapsed time 3345ms
Additional memory used 2416KB
Total UK customers 11634
And it looks like I've used an extra 2MB rather than an extra 180MB in order to achieve the same thing.
Note: My counting of growth in "Private Bytes" here is a little unscientific because all kinds of things might contribute to that byte count (e.g. JITted code and so on) but it's a good enough metric for what I'm doing here.
Writing XML
Now, imagine that I want to take the Customers_100MB.xml file and transform it in some fashion. I've already got a reasonably nice way of streaming the file into my program so what happens if I transform it with code such as;
StartCounters();
XElement newXML =
new XElement("customers",
from element in StreamElements("Customers_100MB.xml", "customers")
where (string)element.Attribute("Country") == "UK"
select new XElement("customer",
new XElement("contactDetails",
new XAttribute("contactName", (string)element.Attribute("ContactName")),
new XAttribute("contactTitle", (string)element.Attribute("ContactTitle")))));
PrintCounters("Creating new XML tree");
newXML.Save("c:\\temp\\newXml.xml");
So, here we're taking the <customer/> elements that come back from the input file and we're writing a new file from them that looks like;
<customers><customer><contactDetails contactName="..." contactTitle="..."/>
and the size of this output file is going to be smaller than 100MB because we're not repeating the entire input file but it's still going to be quite a big file. If I run the code I see;
After Creating new XML tree
Elapsed time 3977ms
Additional memory used 7224KB
Now, remembering that I used ~2MB to stream in the input file in the first case this looks like writing out the file has cost an additional ~5MB.
But, in this case, it wasn't really necessary to create the tree referenced by the variable newXML at all. That tree is only used to persist the XML to disk and so it wasn't really necessary to bring the tree into memory (even though I bring it in via streaming).
So, it'd be better if we could stream the XML out in the same way that we're streaming it in. There's a class, XStreamingElement that helps with that and I can use it as below;
StartCounters();
XStreamingElement newXML =
new XStreamingElement("customers",
from element in StreamElements("Customers_100MB.xml", "customers")
where (string)element.Attribute("Country") == "UK"
select new XElement("customer",
new XElement("contactDetails",
new XAttribute("contactName", (string)element.Attribute("ContactName")),
new XAttribute("contactTitle", (string)element.Attribute("ContactTitle")))));
PrintCounters("Creating new XML tree");
newXML.Save("c:\\temp\\newXml.xml");
and the change from the former code is to switch XElement to XStreamingElement and that second class doesn't build a tree in memory here - it runs the query at the point where we hit Save and so it both streams data in and out.
Running this code gives me;
After Creating new XML tree
Elapsed time 15ms
Additional memory used 428KB
Now, I'm not quite sure how I end up with 400K here when it cost me 2MB just to stream the thing into memory and execute query.Count() but we can clearly see that it's giving a big benefit to "stream in and stream out" rather than to fully build the tree in memory.
Again, this wouldn't work for every query by any means because it depends on reading the XML in a forward-only fashion.
What About VB?
Using C# iterators was a nice way to do this but VB doesn't have iterators so what do you do? Here's a bit of VB doing the same thing as the querying code I originally wrote;
StartCounters()
Dim data As XElement = XElement.Load("Customers_100MB.xml")
Dim query = From c In data...<customers> _
Where c.@Country = "UK" _
Select c
Dim count As Integer = query.Count()
PrintCounters("After querying")
Console.WriteLine("Result is {0}", count)
Now, I need a streaming function similar to the StreamElements that we wrote in C# previously so I end up having to build an enumerator;
Private Function StreamElements(ByVal file As String, ByVal element As String) As IEnumerable(Of XElement)
Return (New XElementEnumerable(file, element))
End Function
And then I just wrote (along with the VB editor which wrote some of this code) a quick enumerable/enumerator (not claiming that this is 100% correct but it works in this case);
Imports System.Xml
Public Class XElementEnumerable
Implements IEnumerable(Of XElement)
Private Class ElementEnumerator
Implements IEnumerator(Of XElement)
Private Sub New()
End Sub
Public Sub New(ByVal file As String, ByVal elementName As String)
Me.elementName = elementName
reader = XmlReader.Create(file)
reader.MoveToContent()
End Sub
Public ReadOnly Property Current() As System.Xml.Linq.XElement Implements System.Collections.Generic.IEnumerator(Of System.Xml.Linq.XElement).Current
Get
Dim element As XElement = CType(XElement.ReadFrom(reader), XElement)
Return (element)
End Get
End Property
Public ReadOnly Property Current1() As Object Implements System.Collections.IEnumerator.Current
Get
Throw New InvalidOperationException("Not implemented")
End Get
End Property
Public Function MoveNext() As Boolean Implements System.Collections.IEnumerator.MoveNext
Dim moved As Boolean = False
While (Not moved AndAlso reader.Read())
If reader.NodeType = XmlNodeType.Element And reader.Name = elementName Then
moved = True
End If
End While
Return (moved)
End Function
Public Sub Reset() Implements System.Collections.IEnumerator.Reset
Throw New InvalidOperationException("Can not reset this enumerator")
End Sub
Private disposedValue As Boolean = False ' To detect redundant calls
' IDisposable
Protected Overridable Sub Dispose(ByVal disposing As Boolean)
If Not Me.disposedValue Then
If disposing Then
' TODO: free other state (managed objects).
End If
' TODO: free your own state (unmanaged objects).
' TODO: set large fields to null.
End If
Me.disposedValue = True
End Sub
Private elementName As String
Private reader As XmlReader
#Region " IDisposable Support "
' This code added by Visual Basic to correctly implement the disposable pattern.
Public Sub Dispose() Implements IDisposable.Dispose
' Do not change this code. Put cleanup code in Dispose(ByVal disposing As Boolean) above.
Dispose(True)
GC.SuppressFinalize(Me)
End Sub
#End Region
End Class
Private Sub New()
End Sub
Public Sub New(ByVal file As String, ByVal elementName As String)
Me.file = file
Me.elementName = elementName
End Sub
Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of System.Xml.Linq.XElement) _
Implements System.Collections.Generic.IEnumerable(Of System.Xml.Linq.XElement).GetEnumerator
Return New ElementEnumerator(file, elementName)
End Function
Public Function GetEnumerator1() As System.Collections.IEnumerator _
Implements System.Collections.IEnumerable.GetEnumerator
Throw New InvalidOperationException("Not implemented")
End Function
Private file As String
Private elementName As String
End Class
and that does the trick in VB and gets me back to streaming my input XML into my program.
Posted
Sat, Sep 8 2007 6:20 AM
by
mtaulty