博客 Pyspark DataFrame中Column使用

Pyspark DataFrame中Column使用

   数栈君   发表于 2024-11-28 10:30  336  0

pyspark dataframe Column

alias 重命名列(name)

df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
df.select(df.age.alias("age2")).show()
+----+
|age2|
+----+
| 2|
| 5|
+----+

astype alias cast 修改列类型

data.schema
StructType([StructField('name', StringType(), True), StructField('age', StringType(), True), StructField('id', StringType(), True), StructField('gender', StringType(), True)])
data.printSchema()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)

# 增加一列使用cast修改类型
from pyspark.sql.types import LongType
data.withColumn('age2',data['age'].cast(LongType())).show()
+-----+---+---+------+----+
| name|age| id|gender|age2|
+-----+---+---+------+----+
| ldsx| 12| 1| 男| 12|
|test1| 20| 1| 女| 20|
|test2| 26| 1| 男| 26|
|test3| 19| 1| 女| 19|
|test4| 51| 1| 女| 51|
|test5| 13| 1| 男| 13|
+-----+---+---+------+----+
# 查看scheam
data.withColumn('age2',data['age'].cast(LongType())).schema
StructType([StructField('name', StringType(), True), StructField('age', StringType(), True), StructField('id', StringType(), True), StructField('gender', StringType(), True), StructField('age2', LongType(), True)])
data.withColumn('age2',data['age'].cast(LongType())).printSchema()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- age2: long (nullable = true)

# 结合别名使用
data.select(df.age.cast("string").alias('ages')).collect()

between 筛选

df.select('name',df.age.between(1,5)).show()
+-----+---------------------------+
| name|((age >= 1) AND (age <= 5))|
+-----+---------------------------+
|Alice| true|
| Bob| true|
+-----+---------------------------+
df.select('name',df.age.between(1,5).alias('ldsx')).show()
+-----+----+
| name|ldsx|
+-----+----+
|Alice|true|
| Bob|true|
+-----+----+

# filter 通过bool值过滤
df.filter(df.age.between(1,5)).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+

bitwiseAND 按位运

df = spark.createDataFrame([Row(a=170, b=75)])
df.show()
+---+---+
| a| b|
+---+---+
|170| 75|
+---+---+
# 列于列按位计算
df.select(df.a.bitwiseAND(df.b)).show()
+-------+
|(a & b)|
+-------+
| 10|
+-------+

contains 包含元素

包含其他元素。根据字符串匹配返回一个布尔列。

data.select(data.name.contains('test')).show()
+--------------------+
|contains(name, test)|
+--------------------+
| false|
| true|
| true|
| true|
| true|
| true|
+--------------------+

data.filter(data.name.contains('test')).show()
+-----+---+---+------+
| name|age| id|gender|
+-----+---+---+------+
|test1| 20| 1| 女|
|test2| 26| 1| 男|
|test3| 19| 1| 女|
|test4| 51| 1| 女|
|test5| 13| 1| 男|
+-----+---+---+------+

withField 字段表达式操作

按名称添加/替换StructType中字段的表达式。按名称添加/替换StructType中字段的表达式。

df = spark.createDataFrame([Row(a=Row(b=1, c=2))])
df.show()
+------+
| a|
+------+
|{1, 2}|
+------+

# 替换df中a类中row对象b的值
df.withColumn('a',df['a'].withField('b',lit(10086))).show()
+----------+
| a|
+----------+
|{10086, 2}|
+----------+

dropFields 删除字段

StructType中字段的表达式。如果架构不包含字段名,则此操作无效。

df = spark.createDataFrame([Row(a=Row(b=1, c=2, d=3, e=Row(f=4, g=5, h=6)))])
df.show()
+--------------------+
| a|
+--------------------+
|{1, 2, 3, {4, 5, 6}}|
+--------------------+

# 修改a列 处理row对象中 b,c
df.withColumn('a', df['a'].dropFields('b', 'c')).show()
+--------------+
| a|
+--------------+
|{3, {4, 5, 6}}|
+--------------+

# 处理row对象中的嵌套内容
df.withColumn("a", col("a").dropFields("e.g", "e.h")).show()
+--------------+
| a|
+--------------+
|{1, 2, 3, {4}}|
+--------------+

getField 获取列

在StructType中按名称获取字段的表达式

df = spark.createDataFrame([Row(r=Row(a=1, b="b"))])
df.show()
+------+
| r|
+------+
|{1, b}|
+------+
df.select(df.r.getField("b")).show()
+---+
|r.b|
+---+
| b|
+---+
df.select(df.r.b).show()
+---+
|r.b|
+---+
| b|
+---+

startswitch 字符串开头

字符串以开头。根据字符串匹配返回一个布尔列。

endswith 字符串结尾

字符串以结尾。根据字符串匹配返回一个布尔列。

df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
PyDev console: starting.
df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
df.filter(df.name.endswith('ice')).collect()
[Row(age=2, name='Alice')]


df.select(df.name.endswith('ice')).show()
+-------------------+
|endswith(name, ice)|
+-------------------+
| true|
| false|
+-------------------+

like 模糊匹配

df.filter(df.name.ilike('ali%')).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
+---+-----+

df.filter(df.name.like('ali%')).show()
+---+----+
|age|name|
+---+----+
+---+----+

ilike 模糊匹配不分大小写

不分大小写

df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
df.filter(df.name.ilike('%Ice')).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
+---+-----+

df.select(df.name.ilike('%Ice')).show()
+-----------------+
|ilike(name, %Ice)|
+-----------------+
| true|
| false|
+-----------------+

rlike 正则匹配

df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name='Alice')]

isNotNull 列不为控制

df = spark.createDataFrame([Row(name='Tom', height=80), Row(name='Alice', height=None)])

df.filter(df.height.isNotNull()).show()
+----+------+
|name|height|
+----+------+
| Tom| 80|
+----+------+

df.select(df.height.isNotNull()).show()
+--------------------+
|(height IS NOT NULL)|
+--------------------+
| true|
| false|
+--------------------+

isNull 列为空

df = spark.createDataFrame([Row(name='Tom', height=80), Row(name='Alice', height=None)])

df.filter(df.height.isNull()).show()
+-----+------+
| name|height|
+-----+------+
|Alice| null|
+-----+------+
df.select(df.height.isNull()).show()
+----------------+
|(height IS NULL)|
+----------------+
| false|
| true|
+----------------+

isin 根据范围筛选

df.select( df.age.isin([1,2,3])).show()
+------------------+
|(age IN (1, 2, 3))|
+------------------+
| true|
| false|
+------------------+
df.filter( df.age.isin([1,2,3])).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
+---+-----+

when 与 otherwise 配合使用

如果未调用Column.otherwise(),则对于不匹配的条件将返回None

df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])

df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+

# 查询条件进行筛选,当when不配合otherwise 默认使用null代替
df.select(df.name, when(df.age > 3, 1)).show()
+-----+------------------------------+
| name|CASE WHEN (age > 3) THEN 1 END|
+-----+------------------------------+
|Alice| null|
| Bob| 1|
+-----+------------------------------+

# 使用otherwise 的条件代替null
df.select(df.name, when(df.age > 3, 1).otherwise(0)).show()
+-----+-------------------------------------+
| name|CASE WHEN (age > 3) THEN 1 ELSE 0 END|
+-----+-------------------------------------+
|Alice| 0|
| Bob| 1|
+-----+-------------------------------------+

substr 返回列的字串

索引0,1都为从首位开始 (start,end)

df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+

# 返回
df.select(df.name.substr(1, 2)).show()
+---------------------+
|substring(name, 1, 2)|
+---------------------+
| Al|
| Bo|
+---------------------+

df.select(df.name.substr(2, 10).alias("col")).show()
+----+
| col|
+----+
|lice|
| ob|
+----+

本文系转载,版权归原作者所有,如若侵权请联系我们进行删除!

《数据资产管理白皮书》下载地址:
《行业指标体系白皮书》下载地址:
《数据治理行业实践白皮书》下载地址:
《数栈V6.0产品白皮书》下载地址:
想了解或咨询更多有关袋鼠云大数据产品、行业解决方案、客户案例的朋友,浏览袋鼠云官网:
同时,欢迎对大数据开源项目有兴趣的同学加入「袋鼠云开源框架钉钉技术群」,交流最新开源技术信息,群号码:30537511,项目地址:
0条评论
社区公告
  • 大数据领域最专业的产品&技术交流社区,专注于探讨与分享大数据领域有趣又火热的信息,专业又专注的数据人园地

最新活动更多
微信扫码获取数字化转型资料
钉钉扫码加入技术交流群