XPath 谓词与函数

本章定位 ：掌握 XPath 谓词（[]）的筛选能力和常用函数（position()/last()/count()/contains()/starts-with()/not()），以及联合运算符（|）。

定义与作用

谓词（Predicate）是写在方括号 [] 中的筛选表达式，用于从已选出的节点集中进一步过滤。

XPath 函数 提供了一组内置的计算能力——计数、字符串比较、数学运算、布尔判断等。函数与谓词配合使用，可以构造出非常精确的节点选择条件。

如果说路径表达式是"走路"，谓词就是"筛人"，函数就是"量尺"。

核心原理：节点集的逐步精炼

图解释 ：谓词是对节点集的层层过滤。同一个节点集可以应用不同的谓词得到不同子集，然后用 | 运算符合并。

语法/结构要点

常用谓词模式

谓词	含义	示例
`[n]`	第 n 个（1-based）	`//book[1]`
`[last()]`	最后一个	`//book[last()]`
`[last()-1]`	倒数第二个	`//book[last()-1]`
`[position()<3]`	前 2 个	`//book[position()<3]`
`[@attr]`	有该属性的	`//book[@category]`
`[@attr='val']`	属性等于某值	`//book[@category='web']`
`[element>val]`	子元素值比较	`//book[price>35]`

常用函数

函数	返回类型	作用	示例
`position()`	number	当前节点在集合中的位置	`[position()=1]`
`last()`	number	集合中最后一个的位置	`[last()]`
`count()`	number	节点计数	`count(//book)`
`contains()`	boolean	字符串包含判断	`[contains(title,'XML')]`
`starts-with()`	boolean	字符串前缀判断	`[starts-with(@id,'S')]`
`not()`	boolean	逻辑非	`[not(@category)]`
`string-length()`	number	字符串长度	`[string-length(title)>5]`
`normalize-space()`	string	去首尾空白+合并空格	`normalize-space(title)`
`sum()`	number	求和	`sum(//price)`

联合运算符 `|`

//book[@category='web'] | //book[price>35]

合并两个节点集，结果按文档序排列，自动去重。

完整示例：白歌用谓词做数据报表

场景说明

飞翔科技的架构师白歌要统计图书目录的业务数据："价格最高的3本书是哪些？""所有 web 类图书的平均价格？""书名超过10个字的书有几本？"

XML 数据

<bookstore>
  <book category="web">
    <title lang="en">Learning XML</title>
    <price>39.95</price>
  </book>
  <book category="web">
    <title lang="en">XSLT Programmer's Reference</title>
    <price>49.99</price>
  </book>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter and the Philosopher's Stone</title>
    <price>29.99</price>
  </book>
  <book category="web">
    <title lang="en">Web Services Architecture</title>
    <price>55.00</price>
  </book>
</bookstore>

XPath 查询

from lxml import etree

tree = etree.parse("bookstore.xml")

# 前3本书
top3 = tree.xpath("//book[position()<=3]/title/text()")
print(f"前3本: {top3}")

# 价格最高的书（利用谓词和比较）
most_expensive = tree.xpath("//book[price>50]/title/text()")
print(f"价格>50: {most_expensive}")

# web 类图书总数
web_count = tree.xpath("count(//book[@category='web'])")
print(f"Web类数量: {web_count}")

# 书名超过10个字的
long_titles = tree.xpath(
    "//book[string-length(normalize-space(title))>10]/title/text()"
)
print(f"书名>10字: {long_titles}")

# 所有 book 的总价
total = tree.xpath("sum(//book/price)")
print(f"总价: {total}")

# 用 contains() 搜索
xml_books = tree.xpath("//book[contains(title,'XML')]/title/text()")
print(f"含XML的书: {xml_books}")

# 联合：web或价格>40
combined = tree.xpath("//book[@category='web'] | //book[price>40]")
print(f"Web或贵书: {len(combined)} 本")

操作结果

前3本: ['Learning XML', "XSLT Programmer's Reference", 'Everyday Italian']
价格>50: ['Web Services Architecture']
Web类数量: 3.0
书名>10字: ["XSLT Programmer's Reference", 'Harry Potter and the Philosopher\'s Stone']
总价: 204.93
含XML的书: ['Learning XML']
Web或贵书: 4 本

白歌用 7 个 XPath 查询完成了原本需要写 50 行 DOM 遍历代码的分析任务。

易错场景

错误一：谓词索引从 1 开始（不是 0）

//book[1]   ← 选第1个book，不是第0个
//book[0]   ← 什么都不选（XPath 索引从1开始）

错误二：`[]` 内的比较是子元素比较

//book[price>35]

这里的 price 是 book 的子元素，不是属性。如果要比较属性，写 [@price>35]。

错误三：`[]` 和 `position()` 在不同上下文行为不同

在 xsl:for-each 中 [position()=1] 和直接写 [1] 可能不同——前者依赖上下文位置，后者是集合中的绝对序号。

面试考点

考点	参考答案要点
XPath 谓词的作用是什么？	对节点集进行过滤筛选。语法为 `[expression]`，位置索引从 1 开始。支持属性筛选、子元素值比较、函数调用
contains() 和 starts-with() 的典型应用？	contains() 用于模糊搜索（如搜书名含"XML"的书）；starts-with() 用于前缀过滤（如所有 ID 以 S 开头的学生）
`	` 运算符的作用和注意事项？