From survey participants to agents

In the social sciences, truly relational datasets are relatively scarce, while individual-centered survey data are far more common. Pop2net provides methods for generating networks from such survey data by first creating actors from survey participants and then connecting them based on their empirical attributes. This approach is particularly effective with datasets that include quasi-network information—for example, household membership data found in surveys like the German Socio-Economic Panel (SOEP).

The following basic example illustrates how Pop2net can be used to construct networks from survey data.

Let’s begin by creating an artificial example dataset:

[1]:
import pop2net as p2n
from pop2net.data_fakers.soep import soep_faker

df_soep = soep_faker.soep(size=100, seed=1)
df_soep.head(20)
[1]:
age gender work_hours_day nace2_division hid pid
0 20.0 male 0.000000 -2 2201 583
1 44.0 female 0.000000 -2 2201 868
2 32.0 female 0.000000 -2 1033 262
3 98.0 female 0.000000 -2 1033 121
4 95.0 female 12.613089 86 8117 780
5 38.0 male 0.000000 -2 8117 461
6 52.0 male 6.528070 64 8117 484
7 80.0 female 0.000000 -2 6219 808
8 44.0 male 1.889185 88 6219 215
9 81.0 female 0.000000 -2 6219 97
10 4.0 male 0.000000 -2 6219 500
11 48.0 male 0.000000 -2 464 915
12 20.0 female 5.524712 86 464 856
13 65.0 female 0.000000 -2 464 400
14 66.0 male 0.000000 -2 464 444
15 85.0 male 0.000000 -2 464 623
16 32.0 female 0.000000 -2 34 713
17 49.0 female 7.113968 84 34 457
18 42.0 male 0.000000 -2 34 273
19 97.0 female 0.000000 -2 3748 606
[2]:
env = p2n.Environment()
creator = p2n.Creator(env)

Using the creator, we can sample from the dataset above. While doing so, we can specify a column to serve as the sampling unit. In this example, we use the household identifier “household_id” as the sample_level. Additionally, it is possible to oversample the data, meaning we can generate more sampled actors than there are original rows. Here, we set the number of samples to 500.

[3]:
df_sample = creator.draw_sample(
    df=df_soep,  # set the dataset to be sampled from
    n=500,  # set the number of lines to be sampled
    sample_level="hid",  # set the unit of sampling
    sample_weight=None,  # if needed, set a column as the sampling weight
)

df_sample.head(20)
[3]:
age gender work_hours_day nace2_division hid pid hid_original
0 95.0 female 12.613089 86 1 780 8117
1 38.0 male 0.000000 -2 1 461 8117
2 52.0 male 6.528070 64 1 484 8117
3 36.0 female 0.000000 -2 2 562 6014
4 65.0 female 9.102887 99 2 720 6014
5 38.0 other 6.834517 86 2 795 6014
6 51.0 male 8.000000 69 2 691 6014
7 53.0 female 0.000000 -2 2 756 6014
8 28.0 female 3.963627 84 2 384 6014
9 97.0 female 0.000000 -2 3 606 3748
10 2.0 female 0.000000 -2 3 968 3748
11 32.0 male 0.000000 -2 4 450 1416
12 37.0 female 0.000000 -2 4 680 1416
13 61.0 male 5.049022 64 4 521 1416
14 53.0 male 8.000000 99 5 945 9116
15 5.0 male 0.000000 -2 5 658 9116
16 53.0 male 0.000000 -2 6 924 1674
17 90.0 male 0.000000 -2 6 326 1674
18 67.0 male 0.000000 -2 7 520 6915
19 49.0 female 0.000000 -2 7 850 6915

If you look closely, the method used overwrites the column specified as the sample level (e.g., “hid”) with new, unique values. The original values are preserved in a separate column (e.g., “hid_original”). This ensures that the sample level can still be used to connect actors, without introducing artifacts caused by duplicated samples. If desired, this behavior can be disabled by setting the replace_sample_level_column argument to False.

Now we create the actors from the sampled data:

[4]:
_ = creator.create_actors(df=df_sample)

Now let’s use two types of locations (households and work places) to connect the created actors based on their empirical attributes:

[5]:
class Household(p2n.LocationDesigner):
    def split(self, actor):
        """Build one household location for each hid."""
        return actor.hid


class Work(p2n.LocationDesigner):
    n_actors = 10  # set the number of actors per work place to 10

    def split(self, actor):
        """Build work places separated by industry."""
        return actor.nace2_division

    def filter(self, actor):
        """Connect only actors with valid industry codes."""
        return actor.nace2_division > 0

    def weight(self, actor):
        """Weight the connection between the actor and the location
        by the empirical working time."""
        return actor.work_hours_day
[6]:
_ = creator.create_locations(location_designers=[Household, Work])
[7]:
inspector = p2n.NetworkInspector(env)
inspector.plot_networks(location_color="label")