用XmlTextReader分割大型XML文件

2010年7月13日星期二 | | |

       前不久从维基百科上下了一个中文维基百科的主要数据包,解压后发现竟然有900M,如此大的XML文件除非内存超大,否则根本无法打开。为此,我想到了把它分割为一些较小的XML文件,并且在文件中保留原有的XML结构。

       首先,要分析XML的结构,然后用C#专门针对处理XML的高效工具XmlTextReader对文件进行分割,用XmlTextWriter写入新的小文件中,最后,编程即可。

using System;
using System.Xml;
using System.IO;
using System.Text;

namespace data
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("正在执行XML分割,请耐心等待......");
            DateTime starttime = DateTime.Now;
            XmlTextReader xr = new XmlTextReader("pages-articles.xml");
            xr.WhitespaceHandling = WhitespaceHandling.None;
            xr.MoveToContent();
            //StreamWriter sw = File.CreateText("out.txt");
            xr.Read();
            xr.Skip();//跳过无用节点
            int n=1;
            StreamWriter indexer = File.CreateText("indexer.txt");//构建文本索引
            while (!xr.EOF)
            //for (int j = 1; j <= 20; j++)
            {
                XmlTextWriter xw = new XmlTextWriter("../db/" + XmlConvert.ToString(n) + ".xml", xr.Encoding);//构建分布式数据文件
                xw.WriteStartDocument();
                xw.WriteStartElement("file");
                xw.WriteAttributeString("id", XmlConvert.ToString(n));

                for (int i = 0; i < 200; i++)
                {
                    if (!xr.EOF)
                    {
                        xr.Read();
                        xw.WriteStartElement("page");
                        string temp1 = xr.ReadInnerXml();//记录关键字,用来创建索引
                        xw.WriteElementString("title", temp1);
                        string temp2 = xr.ReadInnerXml();//记录ID,用来创建索引
                        xw.WriteElementString("id", temp2);
                        if (temp1 != "")
                        {
                            indexer.WriteLine("{0}|{1}|{2}", temp1, n, temp2);
                        }

                        if (xr.Name == "restrictions")
                        {
                            xw.WriteElementString("restrictions", xr.ReadInnerXml());
                        }

                        xr.Read();
                        xw.WriteStartElement("revision");
                        xw.WriteElementString("id", xr.ReadInnerXml());
                        xw.WriteElementString("timestamp", xr.ReadInnerXml());
                        xr.Read();
                        xw.WriteStartElement("contributor");
                        xw.WriteElementString("username", xr.ReadInnerXml());
                        xw.WriteElementString("id", xr.ReadInnerXml());
                        xr.Read();
                        xw.WriteEndElement();
                        xw.WriteElementString("minor", xr.ReadInnerXml());
                        xw.WriteElementString("comment", xr.ReadInnerXml());
                        xw.WriteElementString("text", xr.ReadInnerXml());
                        xw.WriteEndElement();
                        xr.Read();
                        xr.Skip();
                        xw.WriteEndElement();
                    }
                    else
                    {
                        break;
                    }
                }
                //xw.WriteStartElement("page");
                //xw.WriteString(xr.ReadInnerXml());
                //xw.WriteEndElement();
                xw.WriteEndDocument();
                xw.Flush();
                xw.Close();

                n++;
            }
            /*for (int i = 0; i < 200; i++)
            {
                Console.WriteLine(xr.Name);
                //if (xr.NodeType == XmlNodeType.Text)
                //{
                //   sw.WriteLine(xr.Value);
                //}
                xr.Read();
            }*/
            xr.Close();
            DateTime endtime = DateTime.Now;
            //sw.Close();
            Console.Write("分割成功,用时{0}",endtime-starttime);
            Console.Read();
        }
    }
}

        没有想到如此大的一个文件,分割为1500多个文件只用了40多秒,足以见得XmlTextReader的高效。

只因我没有安装VS2010 等版本 所以没有测试..应该可以成功...备份一下..以后安装的时候再测试一下.
我的QQ空间
政和县各政府政务部门地址及电话
政府政务部门地址电话政府政务部门    ...
 

0 评论:


所有文章收集于网络,如果有牵扯到版权问题请与本站站长联系。谢谢合作![email protected]