AllenNLP学习笔记

记录关于AllenNLP使用过程中的发现、阅读源码的笔记(乱记)等。

避免循环引用的方法

可将部分引用置于函数方法内,而不是放在module内,以避免循环引用。

例如~allennlp.common.from_params.py中的construct_arg函数:

1
2
3
4
5
6
7
8
9
def construct_arg(cls: Type[T], 
param_name: str,
annotation: Type,
default: Any,
params: Params,
**extras) -> Any:

from allennlp.models.archival import load_archive # import here to avoid circular imports
... # other codes

格式化字符串的另一种方法

之前格式化字符串的方法一般是使用%.format(...)

这里发现也可以使用f"...{var1}...{var2}..."这样的形式来输出格式化字符串,例如:

1
2
logger.info(f"instantiating class {cls} from params {getattr(params, 'params', params)} "
f"and extras {set(extras.keys())}")

Python装饰器(Decorator)

为了看懂AllenNLP中的Registrable,也将之前不了解的Python装饰器的知识学习了一下。

也就是对函数、方法或类的注解@,它本质上是一个接收函数、方法或类,并返回函数或其它任意东西的函数(有点绕口),而@的写法是Python提供的一个语法糖。

从形式上来看,这个装饰器的定义一般为双层函数嵌套,形如:

1
2
3
4
5
6
7
8
def decorator(func):
def inner(args)
... # some code
return inner

@decorator
def some_function(args):
... # some code

装饰器相当于是接受了其注解的那个函数、方法或类,并将其替换(或者说装饰)成该函数(装饰器)返回的内容。

装饰器的教程见https://www.cnblogs.com/Jerry-Chou/archive/2012/05/23/python-decorator-explain.html
,例子见https://wiki.python.org/moin/PythonDecoratorLibrary

对一些包的封装

例如AllenNLP想在tqdm.tqdm的基础上封装一些全局配置项,它是这么做的(见~allennlp.common.tqdm.Tqdm类):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"""
:class:`~allennlp.common.tqdm.Tqdm` wraps tqdm so we can add configurable
global defaults for certain tqdm parameters.
"""

from tqdm import tqdm as _tqdm
# This is neccesary to stop tqdm from hanging
# when exceptions are raised inside iterators.
# It should have been fixed in 4.2.1, but it still
# occurs.
# TODO(Mark): Remove this once tqdm cleans up after itself properly.
# https://github.com/tqdm/tqdm/issues/469
_tqdm.monitor_interval = 0

class Tqdm:
# These defaults are the same as the argument defaults in tqdm.
default_mininterval: float = 0.1

@staticmethod
def set_default_mininterval(value: float) -> None:
Tqdm.default_mininterval = value

@staticmethod
def set_slower_interval(use_slower_interval: bool) -> None:
"""
If ``use_slower_interval`` is ``True``, we will dramatically slow down ``tqdm's`` default
output rate. ``tqdm's`` default output rate is great for interactively watching progress,
but it is not great for log files. You might want to set this if you are primarily going
to be looking at output through log files, not the terminal.
"""
if use_slower_interval:
Tqdm.default_mininterval = 10.0
else:
Tqdm.default_mininterval = 0.1

@staticmethod
def tqdm(*args, **kwargs):
new_kwargs = {
'mininterval': Tqdm.default_mininterval,
**kwargs
}

return _tqdm(*args, **new_kwargs)

AllenNLP中的register注解

Python装饰器(Decorator),可知@装饰器一般是一个接收函数并返回函数的函数(当装饰函数时)或一个接收类(Type)并返回类(Type)的函数(当装饰类时)。

AllenNLP中的register方法实现在~allennlp.common.registrable.Registrable类中,如下:

1
2
3
4
5
6
7
8
9
10
11
12
@classmethod
def register(cls: Type[T], name: str):
registry = Registrable._registry[cls]
def add_subclass_to_registry(subclass: Type[T]):
# Add to registry, raise an error if key has already been used.
if name in registry:
message = "Cannot register %s as %s; name already in use for %s" % (
name, cls.__name__, registry[name].__name__)
raise ConfigurationError(message)
registry[name] = subclass
return subclass
return add_subclass_to_registry

可以看到这个装饰器并没有改变返回的类型(仍然是subclass),但其在返回前注册了该subclass的名字(通过registry[name] = subclass),并在之前进行了重复性检查。

既然有了这个注册表,那就可以通过注册的名字获取某个基类的某种特定实现(子类)了,这也是Registrable的类方法之一,如下:

1
2
3
4
5
6
@classmethod
def by_name(cls: Type[T], name: str) -> Type[T]:
logger.debug(f"instantiating registered subclass {name} of {cls}")
if name not in Registrable._registry[cls]:
raise ConfigurationError("%s is not a registered name for %s" % (name, cls.__name__))
return Registrable._registry[cls].get(name)

另外还有一个类方法是列出所有注册的子类名字。

AllenNLP中的FromParams

FromParams是AllenNLP中所有可配置类(例如Model)的基础,这个基类仅包含一个类方法,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class FromParams:
"""
Mixin to give a from_params method to classes. We create a distinct base class for this
because sometimes we want non-Registrable classes to be instantiatable from_params.
"""
@classmethod
def from_params(cls: Type[T], params: Params, **extras) -> T:
"""
This is the automatic implementation of `from_params`. Any class that subclasses `FromParams`
(or `Registrable`, which itself subclasses `FromParams`) gets this implementation for free.
If you want your class to be instantiated from params in the "obvious" way -- pop off parameters
and hand them to your constructor with the same names -- this provides that functionality.

If you need more complex logic in your from `from_params` method, you'll have to implement
your own method that overrides this one.
"""
... # implementation codes

该方法的作用就是,通过传入配置参数params,来获得该类的一个实例。其中的from_params提供了默认的实现方法,将在下面进行介绍。

AllenNLP中的Registrable类便是以FromParams类为基类实现的。因此,能使用.register()方法进行注册的类必然可以通过.from_params()方法通过配置获得其具体实例;而反之则不一定。

创建临时文件夹的方法

有时候,需要临时将文件解压存放到临时文件夹中,在AllenNLP中就是载入archive的时候需要tar.gz文件解压。

AllenNLP中用到了Python内置的包:tempfile,见~allennlp.models.archive.py:180:189

1
2
3
4
5
6
7
8
9
10
# Extract archive to temp dir
tempdir = tempfile.mkdtemp()
logger.info(f"extracting archive file {resolved_archive_file} to temp dir {tempdir}")
with tarfile.open(resolved_archive_file, 'r:gz') as archive:
archive.extractall(tempdir)
# Postpone cleanup until exit in case the unarchived contents are needed outside
# this function.
atexit.register(_cleanup_archive_dir, tempdir)

serialization_dir = tempdir

在程序退出时注册执行动作

在上面的例子中,需要在程序退出时删除创建的临时文件,这就使用atexit模块,见~allennlp.models.archive.py:185:187

1
2
3
# Postpone cleanup until exit in case the unarchived contents are needed outside
# this function.
atexit.register(_cleanup_archive_dir, tempdir)

其中的_cleanup_archive_dir是需要执行的函数,tempdir是其参数。

Author: yym6472
Link: https://yym6472.github.io/2019/06/29/AllenNLP学习笔记/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.